No Language Left Behind: one model translating 200 languages

No Language Left Behind (NLLB) is Meta’s effort to build a single machine translation model that handles 200 languages, with a deliberate emphasis on low-resource languages that commercial systems had long ignored. The flagship model, NLLB-200, was open-sourced in July 2022 and the underlying research was later published in the journal Nature in 2024 as “Scaling neural machine translation to 200 languages.”

The project’s distinguishing choice was to invest heavily in languages with little available data. NLLB-200 contains three times as many low-resource languages as high-resource ones and supports 55 African languages with high quality, a sharp expansion over previous tools. Meta reported that the model achieved a 44% improvement in translation quality (measured by BLEU) over the prior state of the art across the languages it covers.

Technically, NLLB-200 is a large model - Meta describes roughly 54 billion parameters - built on a Sparsely Gated Mixture-of-Experts architecture. That design lets the network route low-resource languages, which have little training data, into shared capacity so they benefit from related languages without the model overfitting on their scarce examples. To evaluate so many languages, the team built FLORES-200, a benchmark of professionally translated sentences covering all 200 languages, and mined bilingual text from web archives to expand the training data.

Meta open-sourced the models and data and reported that the underlying technology helps power tens of billions of translations a day across Facebook, Instagram, and Wikipedia. NLLB is a landmark in pushing language technology toward the long tail of the world’s roughly 7,000 languages, most of which had been excluded from the AI translation boom.

No Language Left Behind: one model translating 200 languages

Sources

Related