Performance of ChatGPT on USMLE (Kung et al., 2023)

In February 2023 PLOS Digital Health published “Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models,” by Tiffany Kung and colleagues at AnsibleHealth and several US medical schools. The study tested ChatGPT - the general-purpose model OpenAI had released only weeks earlier - on questions drawn from the three-step United States Medical Licensing Examination, the sequence every US physician must pass.

Across the three steps and several question formats, ChatGPT’s accuracy ranged roughly from the mid-40s to the high-60s in percent: about 45 to 75 percent on Step 1, 54 to 62 percent on Step 2CK, and 62 to 69 percent on Step 3. The USMLE passing threshold sits at approximately 60 percent. The authors concluded that ChatGPT “performed at or near the passing threshold for all three exams without any specialized training or reinforcement.”

The result drew wide attention because the model was not built for medicine. Human candidates typically spend hundreds of hours preparing for each step; a general chatbot landed near passing with none of that domain training. The paper also noted that ChatGPT produced coherent explanations for its answers, which the authors saw as a potential aid to medical education.

The study is best read alongside the cautions that accompanied it and the medical-LLM work that followed, such as Med-PaLM. Passing a licensing exam measures recall and reasoning on structured questions; it does not measure whether a model’s free-text clinical advice is accurate, complete, or safe. The exam result was a striking benchmark, not a clearance to practice.