SimpleQA

SimpleQA is a factuality benchmark introduced by OpenAI researchers Jason Wei and colleagues in a November 2024 paper titled “Measuring short-form factuality in large language models.” It consists of short, fact-seeking questions with a single, indisputable answer, written so that each response can be graded automatically as correct, incorrect, or not attempted. The questions were collected adversarially against GPT-4, meaning they were deliberately chosen to be ones that strong models tend to get wrong, which keeps the benchmark hard even as models improve.

The design goal is narrow on purpose. Rather than testing reasoning or long-form writing, SimpleQA isolates one thing: does the model state true facts, and does it know when it does not know. By tracking the rate of “not attempted” answers alongside correct and incorrect ones, the benchmark also measures calibration, or whether a model abstains when it is uncertain instead of confidently making things up. The authors report that even frontier models answer many questions wrong while still attempting them, which is the behavior most associated with hallucination.

For a business reader, SimpleQA matters because it puts a number on a risk that shows up constantly in real deployments: a model that sounds authoritative but is factually wrong. A benchmark that rewards saying “I do not know” is closer to what enterprises actually need than one that only rewards confident answers.

Sources

Related