SQuAD (Stanford Question Answering Dataset)

SQuAD, the Stanford Question Answering Dataset, was introduced in the June 2016 paper “SQuAD: 100,000+ Questions for Machine Comprehension of Text” by Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. It is a reading-comprehension dataset: crowdworkers posed more than 100,000 questions about a set of Wikipedia articles, and the answer to each question is a span of text taken directly from the corresponding passage. Because answers are exact spans, scoring is automatic and unambiguous, which made SQuAD a clean, repeatable yardstick.

When it launched, the gap between machines and people was wide. The paper reported human performance around 86.8 percent F1 against a logistic-regression baseline near 51 percent, framing the dataset as “a good challenge problem.” Over the next two years that gap closed dramatically, and SQuAD became the headline leaderboard where attention-based and then pre-trained models showed off - the arrival of BERT in 2018 pushed scores to and past the human baseline.

SQuAD 2.0, released in 2018 by Rajpurkar, Robin Jia, and Liang, made the task harder by adding over 50,000 unanswerable questions written adversarially to look answerable. To do well, a model now has to recognize when the passage does not support any answer and abstain rather than guess - directly targeting the overconfident-fabrication failure mode that still matters for real systems. SQuAD sits alongside GLUE as one of the benchmarks that defined progress in natural language processing just before large language models took over.

Sources

Last verified June 7, 2026