SeamlessM4T, a single model for speech and text translation

Meta introduced SeamlessM4T on August 22, 2023, describing it as a single foundational model that performs translation and transcription across both speech and text. Where earlier systems chained together separate models - one to transcribe speech, one to translate, one to synthesize audio - SeamlessM4T (the “M4T” stands for Massively Multilingual and Multimodal Machine Translation) did the whole job in one model, which Meta argued reduced errors and delays.

The model supports nearly 100 languages and five capabilities in one: speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition. The coverage varied by task - speech recognition and text translation spanned close to 100 languages, while speech output was supported into about 36 languages - but the breadth of a unified speech-and-text translator was new.

To train it, Meta built on its earlier multilingual speech work and assembled a large corpus of automatically aligned speech translations. Alongside the model, the company released the metadata of SeamlessAlign, which it called the largest open multimodal translation dataset to date at 270,000 hours of mined speech and text alignments. The model was released under a research license so others could build on it.

SeamlessM4T pointed toward the long-imagined universal speech translator - the science-fiction device that lets people speaking different languages understand each other in real time. By folding transcription, translation, and synthesis into one multimodal model, it foreshadowed the increasingly unified, multimodal systems that would define the next phase of AI.

SeamlessM4T, a single model for speech and text translation

Sources

Related