In May 2023 Google researchers posted “Towards Expert-Level Medical Question Answering with Large Language Models,” describing Med-PaLM 2. On MedQA, the dataset of US Medical Licensing Examination-style multiple-choice questions, the paper reports that “Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19%,” and setting a new state of the art at the time.
The jump came from a stronger base model (PaLM 2) and refined prompting and tuning. The authors also ran human evaluations: in pairwise comparisons of answers to consumer medical questions, physicians preferred Med-PaLM 2’s answers to physician-written answers on eight of nine axes related to clinical utility. As with the first Med-PaLM, the team framed these results as progress toward, not arrival at, clinical readiness - a high exam score does not by itself make a model safe to deploy.