GPQA (Graduate-Level Google-Proof Q&A)

GPQA measures whether an AI system can answer genuinely hard science questions that resist shortcuts. It contains 448 multiple-choice questions in biology, physics, and chemistry, written and validated by domain experts to be “Google-proof” - meaning a skilled non-expert with unrestricted web access still struggles to answer them. This makes it a test of deep reasoning rather than information retrieval.

The benchmark was introduced by David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman in “GPQA: A Graduate-Level Google-Proof Q&A Benchmark,” posted in November 2023. The authors reported that PhD-level experts reached about 65 percent accuracy while skilled non-experts reached only about 34 percent even with 30-plus minutes of web access.

GPQA became a standard reference for frontier reasoning because its difficulty stays meaningful even as models improve, and because the human baselines make the scores easy to interpret. It is frequently cited when labs claim progress on expert-level reasoning.

Current model scores live on official and community leaderboards and shift as models are updated, so they are not listed here.

GPQA (Graduate-Level Google-Proof Q&A)

Sources

Related