Massively Multilingual Speech, recognition for 1,100+ languages

On May 22, 2023, Meta released the Massively Multilingual Speech (MMS) project, which extended speech technology to a vast number of languages most systems ignore. MMS provided automatic speech recognition (speech-to-text) and speech synthesis (text-to-speech) for over 1,100 languages, and language identification for nearly 4,000 languages, an order-of-magnitude expansion over what was previously available.

The hard part of covering so many languages is data: most of the world’s roughly 7,000 languages have little or no transcribed audio to train on. The team’s solution was to use readings of religious texts - specifically the New Testament, which has been recorded in more than 1,100 languages - giving on average about 32 hours of audio per language with aligned text. Built on the self-supervised wav2vec approach, the models learned general speech representations from large amounts of unlabeled audio and were then fine-tuned on this aligned data.

The quality was strong despite the breadth. Meta reported that on languages it shared with OpenAI’s Whisper, MMS models achieved half the word error rate while covering 11 times as many languages. Meta open-sourced the models and code so that researchers could build on the work and help preserve linguistic diversity.

MMS reframed what speech AI was for. Rather than perfecting a handful of high-resource languages, it aimed at the long tail of human language, many of whose speakers had never had usable speech technology. That goal - keeping endangered and under-served languages inside the AI revolution rather than outside it - became a defining theme of multilingual AI research.