FLORES-200 is the evaluation benchmark Meta built to measure machine translation quality across 200 languages, with a particular focus on low-resource languages that earlier benchmarks could not assess. It was released in 2022 alongside the No Language Left Behind (NLLB) project and described in the paper “No Language Left Behind: Scaling Human-Centered Machine Translation” (arXiv 2207.04672).
The core problem FLORES-200 addresses is that you cannot improve what you cannot measure. For most of the world’s languages there had been no reliable, professionally translated test set, so researchers had no trustworthy way to know whether a system translated, say, Luganda or Asturian well or badly. FLORES-200 provides the same set of sentences translated into all 200-plus languages, which makes it a many-to-many benchmark: because every sentence exists in every language, it supports evaluation across more than 40,000 translation directions using one shared test set.
The sentences were collected from web sources and translated and reviewed by professional translators, with quality control to keep the reference translations accurate. Scores are reported with standard metrics including BLEU and character-level variants better suited to languages where word boundaries differ from English. Meta used FLORES-200 to show that its NLLB-200 model improved translation quality by an average of 44% over prior systems.
Released under an open license, FLORES-200 became a de facto standard for low-resource translation research and for machine translation competitions. It is a good example of how a carefully built benchmark, more than any single model, can redirect a research field - in this case toward the thousands of languages that the AI translation boom had left out.