Glot500: Scaling Multilingual Models to 500 Languages

Glot500 is described in “Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages,” posted to arXiv on May 20, 2023 by Ayyoob Imani, Peiqin Lin, Amir Hossein Kargaran, Hinrich Schutze, and colleagues, and published at ACL 2023. Where most multilingual models cover around 100 languages, Glot500 deliberately scales the other direction, breadth, covering 511 languages, most of them low-resource.

The team built two things: Glot500-c, a cleaned multilingual corpus assembled for these hundreds of languages, and Glot500-m, a model created by continuing to pre-train an existing multilingual model on that corpus. They reported large improvements for both high-resource and low-resource languages, and their analysis found that quality depends on several factors together, the size of available text, the writing script, and how much a language can benefit from transfer from related languages, rather than any single factor.

Glot500 is significant because it tackles the long tail directly. Adding the 401st or 500th language is far harder than the 50th, since data is scarcer and often messier, and the paper is one of the clearest demonstrations that the language frontier can be pushed much further with careful corpus building.

For anyone serving global or underserved communities, Glot500 shows that the set of languages a model can plausibly support is wider than commercial systems usually assume, if someone does the work of gathering and cleaning the text.

Glot500: Scaling Multilingual Models to 500 Languages

Sources

Related