SeamlessM4T Medium

SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text.

SeamlessM4T covers:

📥 101 languages for speech input
⌨️ 96 Languages for text input/output
🗣️ 35 languages for speech output.

This is the "medium" variant of the unified model, which enables multiple tasks without relying on multiple separate models:

Speech-to-speech translation (S2ST)
Speech-to-text translation (S2TT)
Text-to-speech translation (T2ST)
Text-to-text translation (T2TT)
Automatic speech recognition (ASR)

SeamlessM4T models

Model Name	#params	checkpoint	metrics
SeamlessM4T-Large	2.3B	Model card - checkpoint	metrics
SeamlessM4T-Medium	1.2B	Model card - checkpoint	metrics

We provide extensive evaluation results of SeamlessM4T-Medium and SeamlessM4T-Large in the SeamlessM4T paper (as averages) in the metrics files above.

Instructions to run inference with SeamlessM4T models

The SeamlessM4T models are currently available through the seamless_communication package. The seamless_communication package can be installed by following the instructions outlined here: Installation.

Once installed, a Translator object can be instantiated to perform all five of the spoken langauge tasks. The Translator is instantiated with three arguments:

model_name_or_card: SeamlessM4T checkpoint. Can be either seamlessM4T_medium for the medium model, or seamlessM4T_large for the large model
vocoder_name_or_card: vocoder checkpoint (vocoder_36langs)
device: Torch device

import torch
from seamless_communication.models.inference import Translator


# Initialize a Translator object with a multitask model, vocoder on the GPU.
translator = Translator("seamlessM4T_medium", vocoder_name_or_card="vocoder_36langs", device=torch.device("cuda:0"))

Once instantiated, the predict() method can be used to run inference as many times on any of the supported tasks.

Given an input audio with <path_to_input_audio> or an input text <input_text> in <src_lang>, we can translate into <tgt_lang> as follows.

S2ST and T2ST:

# S2ST
translated_text, wav, sr = translator.predict(<path_to_input_audio>, "s2st", <tgt_lang>)

# T2ST
translated_text, wav, sr = translator.predict(<input_text>, "t2st", <tgt_lang>, src_lang=<src_lang>)

Note that <src_lang> must be specified for T2ST.

The generated units are synthesized and the output audio file is saved with:

wav, sr = translator.synthesize_speech(<speech_units>, <tgt_lang>)

# Save the translated audio generation.
torchaudio.save(
    <path_to_save_audio>,
    wav[0].cpu(),
    sample_rate=sr,
)

S2TT, T2TT and ASR:

# S2TT
translated_text, _, _ = translator.predict(<path_to_input_audio>, "s2tt", <tgt_lang>)

# ASR
# This is equivalent to S2TT with `<tgt_lang>=<src_lang>`.
transcribed_text, _, _ = translator.predict(<path_to_input_audio>, "asr", <src_lang>)

# T2TT
translated_text, _, _ = translator.predict(<input_text>, "t2tt", <tgt_lang>, src_lang=<src_lang>)

Note that <src_lang> must be specified for T2TT.

Citation

If you plan to use SeamlessM4T in your work or any models/datasets/artifacts published in SeamlessM4T, please cite:

@article{seamlessm4t2023,
  title={"SeamlessM4T—Massively Multilingual \& Multimodal Machine Translation"},
  author={{Seamless Communication}, Lo\"{i}c Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye,  Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, Abinesh Ramakrishnan, Anna Sun, Kevin Tran, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin Yang, Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-juss\`{a} \footnotemark[3], Onur \,{C}elebi,Maha Elbayad,Cynthia Gao, Francisco Guzm\'an, Justine Kao, Ann Lee, Alexandre Mourachko, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang},
  journal={ArXiv},
  year={2023}
}

License

The Seamless Communication code and weights are CC-BY-NC 4.0 licensed.