WinoGrande

WinoGrande was introduced in “WinoGrande: An Adversarial Winograd Schema Challenge at Scale,” submitted to arXiv on July 24, 2019 by Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi of the Allen Institute for AI, and it won an Outstanding Paper award at AAAI 2020. It is a large-scale, debiased successor to the Winograd Schema Challenge.

The original Winograd Schema Challenge, proposed in 2011, consists of expertly crafted sentence pairs where resolving an ambiguous pronoun requires commonsense reasoning - for instance, deciding whether “they” who feared violence are the city councilmen or the demonstrators. It had only 273 problems. WinoGrande scaled this to about 44,000 problems gathered through crowdsourcing, making it large enough to use for training as well as testing.

To stop models from cheating on statistical shortcuts, the authors ran the data through a new algorithm called AfLite, which strips out examples whose answers can be guessed from superficial word associations rather than reasoning. On the resulting harder set, models of the day scored roughly 59 to 79 percent against 94 percent for humans. The debiasing was the point: it forced systems to do more than exploit dataset artifacts.

WinoGrande became a fixture in language model evaluation suites and, like its peers, a case study in the benchmark-contamination arms race - once a dataset is public, its examples leak into training corpora and scores stop measuring true generalization.

Sources

Last verified June 7, 2026