Med-PaLM and large language models encode clinical knowledge

In July 2023 Nature published “Large language models encode clinical knowledge,” a paper from Google and DeepMind researchers led by Karan Singhal and Shekoofeh Azizi. It introduced MultiMedQA, a benchmark stitching together six existing medical question-answering datasets - spanning professional exams, research, and consumer queries - plus a new dataset called HealthSearchQA built from 3,173 commonly searched consumer health questions. The work was a systematic attempt to measure how much clinical knowledge a general-purpose language model actually carried.

Using a combination of prompting strategies, the team’s Flan-PaLM model reached 67.6% accuracy on MedQA, the dataset of US Medical Licensing Examination-style questions, surpassing the prior state of the art by more than 17 percent. But the authors were careful to separate exam accuracy from clinical usefulness. When clinicians evaluated long-form answers, they found gaps that a multiple-choice score does not reveal.

To close those gaps the team introduced instruction prompt tuning, a parameter-efficient method that aligns a frozen model to a new domain using a handful of exemplars. The resulting model, Med-PaLM, produced consumer-question answers that human raters judged comparable to clinician-written ones on several axes - 92.6% of Med-PaLM answers aligned with scientific consensus, against 92.9% for clinicians. The paper was explicit that the model still fell short of physicians and that “many limitations must be overcome before these models become viable for use in clinical applications.”

Med-PaLM mattered because it set the template for evaluating medical LLMs: not just whether a model can pass a test, but whether its free-text answers are accurate, complete, and safe under expert human review. That distinction - between scoring well on an exam and being trustworthy at the bedside - became the central question for every medical language model that followed.