Global-MMLU is a multilingual evaluation benchmark introduced in a December 2024 paper from Cohere and collaborators. It addresses a problem in the common practice of machine-translating the English MMLU exam into other languages to measure multilingual ability: the translations carry both translation artifacts and a deeper cultural bias, because many MMLU questions assume Western, English-language knowledge. The team rebuilt the benchmark across 42 languages using paid professional and community annotators to verify translation quality and to label questions as either culturally sensitive or culturally agnostic.
Their analysis exposed how skewed the original was. They found that 28 percent of all questions require culturally sensitive knowledge, and that among questions needing geographic knowledge, roughly 85 percent centered on North America or Europe. Crucially, model rankings shifted depending on whether models were scored on the full set or only the culturally agnostic subset, showing that progress measured on translated MMLU can be distorted by which culture the questions implicitly assume.
For anyone comparing models for global use, Global-MMLU is a caution against taking a single translated leaderboard at face value. A model that tops an English-derived exam may not be the best choice for users whose knowledge and context differ from the test’s hidden assumptions.