MMMU was introduced in “MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI,” submitted to arXiv on November 27, 2023 by a team led by Xiang Yue. It was built to test multimodal models the way MMLU tests text models - at the level of difficult college coursework - but with questions that mix images and text.
The benchmark contains about 11,500 questions collected from college exams, quizzes, and textbooks, spanning 6 core disciplines, 30 subjects, and 183 subfields: Art and Design, Business, Science, Health and Medicine, Humanities and Social Science, and Tech and Engineering. Crucially, the questions are heterogeneously multimodal - they include charts, diagrams, chemical structures, medical images, musical scores, and circuit drawings - so a model must actually read and reason over the visual content, not just the words.
The results showed how far multimodal models had to go. The strongest systems at release, GPT-4V and Gemini Ultra, scored only about 56 and 59 percent respectively, far below expert human performance, with many errors traced to a failure to perceive or interpret the image rather than to reason about it. That gap made MMMU a leading yardstick for progress toward what its authors called “Expert AGI.”
MMMU quickly became one of the headline numbers labs report when launching new multimodal models, and spawned harder follow-ups such as MMMU-Pro as top scores climbed.