DROP

DROP, short for Discrete Reasoning Over Paragraphs, raised the bar for reading-comprehension tests. Earlier benchmarks like SQuAD mostly required finding a span of text that answered a question. DROP instead demands that a system combine information from several places in a passage and then perform a discrete operation on it, such as adding numbers, counting events, or sorting items.

The benchmark was introduced by Dheeru Dua, Sameer Singh, Matt Gardner, and colleagues in “DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs,” posted in March 2019. It contains 96,000 crowdsourced questions. At release the gap between machines and people was stark: the best existing system reached only 32.7 percent F1 while human experts scored 96.0 percent, and the authors’ own combined reading-and-reasoning model reached 47.0 percent.

DROP became a standard test for whether models can do arithmetic and logical operations grounded in text rather than just retrieve facts.

For a general reader, DROP reflects a practical need: documents like reports and contracts often require not just locating a figure but computing with several of them, which is exactly what this benchmark measures.

Sources

Last verified June 7, 2026