MMMU-Pro

MMMU-Pro is a more robust version of the MMMU multimodal benchmark, introduced in a September 2024 paper. Its authors noticed that strong models could answer a large share of the original MMMU questions without really using the images, either by exploiting text-only shortcuts or by guessing among the answer choices. MMMU-Pro fixes this through three steps: it filters out questions that text-only models can already answer, it expands the number of candidate answer options to make lucky guessing harder, and it introduces a vision-only setting where the entire question is embedded inside a screenshot so the model must read and see at the same time.

The result is a substantially harder test. The paper reports that model scores drop by roughly 17 to 27 percentage points compared with the original MMMU, exposing how much of the earlier performance came from shortcuts rather than genuine visual understanding. The authors also found that chain-of-thought prompting helped on MMMU-Pro while optical-character-recognition prompts barely moved the needle, suggesting the remaining difficulty is about reasoning over combined text and images, not just reading text out of pictures.

For anyone evaluating multimodal AI, MMMU-Pro is a reminder that a high benchmark score can be inflated by question design. Tightening the test changed the rankings and showed that “the model can see” is a claim worth checking carefully.

Sources

Related