MultiMedQA

MultiMedQA is a medical question-answering benchmark introduced in the 2023 Nature paper “Large language models encode clinical knowledge,” the work that also introduced Med-PaLM. It was assembled to give a single, broad yardstick for how well large language models handle medical knowledge rather than relying on any one narrow test.

The benchmark combines several existing datasets spanning different audiences: professional medical licensing questions (MedQA, drawn from US-style board exams), Indian medical entrance questions (MedMCQA), research literature questions (PubMedQA), a medical subset of the MMLU exam set, and consumer health queries. The authors added a new dataset, HealthSearchQA, built from commonly searched health questions, so the benchmark covers both expert exams and the kind of questions ordinary people ask.

Crucially, MultiMedQA pairs the multiple-choice scoring with a framework for human evaluation of long-form answers, judging factuality, reasoning, potential for harm, and bias. That matters in medicine, where a confidently worded but wrong answer can be dangerous, so a benchmark that only counted multiple-choice accuracy would miss the point.

For a general reader, MultiMedQA is worth knowing because it became the reference benchmark for medical LLMs: progress claims for systems like Med-PaLM and its successors are typically stated as scores on its component tests, especially MedQA.

Sources

Related