XLM-R (XLM-RoBERTa) is the model introduced in “Unsupervised Cross-lingual Representation Learning at Scale,” posted to arXiv on November 5, 2019 by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzman, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov of Facebook AI. It is a Transformer masked language model trained on more than 2 terabytes of filtered CommonCrawl text spanning 100 languages, with no parallel translation data required.
The central result was that a single multilingual model could substantially outperform multilingual BERT (mBERT): the paper reported +14.6% average accuracy on the XNLI cross-lingual inference benchmark and +13% average F1 on the MLQA question-answering benchmark. Just as important, the authors showed this could be done “without sacrificing per-language performance,” meaning the multilingual model stayed competitive with strong models trained on a single language. Gains were especially large for low-resource languages such as Swahili and Urdu.
XLM-R helped settle a long-running worry in multilingual NLP, sometimes called the “curse of multilinguality,” that cramming many languages into one model forces a tradeoff against quality in any one of them. By scaling both data and model capacity, the team showed the tradeoff could be pushed back far enough that one shared model became the practical default for cross-lingual tasks.
For a business, the lesson is that a single multilingual model can serve many markets at once instead of paying to build and maintain a separate model per language, and the benefit is largest exactly where data is scarcest.