MMLU-Pro is a tougher successor to the popular MMLU knowledge test. As leading models began scoring above 85 percent on MMLU, the original benchmark stopped separating strong models from weaker ones. MMLU-Pro responds by expanding each question from four answer choices to ten, removing trivial or noisy items, and adding more questions that require multi-step reasoning rather than recall.
The benchmark was introduced by Yubo Wang, Xueguang Ma, Ge Zhang, Wenhu Chen, and colleagues in “MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark,” posted in June 2024 and accepted to NeurIPS 2024 as a Datasets and Benchmarks spotlight. The authors reported that model accuracy fell by 16 to 33 percent compared to the original MMLU, and that score sensitivity to prompt wording dropped from 4 to 5 percent down to about 2 percent, making results more stable.
A notable finding was that chain-of-thought reasoning meaningfully improved scores on MMLU-Pro, whereas it offered little benefit on the original MMLU. That makes the benchmark a better instrument for studying how reasoning techniques affect knowledge tasks.
For a business reader, MMLU-Pro illustrates a recurring pattern in AI evaluation: as soon as a benchmark is “solved,” the field rebuilds it harder so that headline accuracy numbers stay meaningful.