MGSM, the Multilingual Grade School Math benchmark, was introduced in the October 2022 paper “Language Models are Multilingual Chain-of-Thought Reasoners” by Freda Shi, Jason Wei, and colleagues. It was built by taking 250 problems from the English GSM8K math dataset and translating them by hand into ten typologically diverse languages, including underrepresented ones such as Bengali and Swahili. The goal was to test whether the chain-of-thought reasoning that boosts math performance in English carries over to other languages.
The study found that chain-of-thought reasoning emerges with model scale across languages, and that large models displayed surprisingly strong multilingual reasoning even in languages that are scarce in their training data. This was an important result because it suggested reasoning ability is not tightly bound to English, though performance still varied by language. MGSM went on to become a standard component of multilingual evaluation suites and is frequently cited when labs report how their models perform outside English.
For global organizations, MGSM addresses a practical question that English-only benchmarks ignore: does the model reason as well for a customer writing in Swahili or Bengali as it does for one writing in English. Measuring that gap is a prerequisite for deploying AI fairly across markets.