HellaSwag

HellaSwag was introduced in “HellaSwag: Can a Machine Really Finish Your Sentence?”, submitted to arXiv on May 19, 2019 by Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi of the University of Washington and the Allen Institute for AI, and presented at ACL 2019. It is a commonsense reasoning benchmark in which a model reads the start of an everyday scenario and must pick the most plausible continuation from four choices.

The clever part is how the wrong answers were made. The authors used adversarial filtering: machine-generated wrong endings were repeatedly screened by a series of discriminator models, keeping only those that fooled the machines while still looking obviously wrong to people. This pushed the data into what they called a “Goldilocks zone” of difficulty - easy for humans, hard for models.

The result was a dramatic gap. Humans scored above 95 percent, while state-of-the-art models at the time, including the then-impressive BERT, scored below 48 percent. HellaSwag thereby exposed how shallow machine “commonsense” still was even as models were saturating earlier benchmarks.

The benchmark aged in an instructive way: within a few years large language models pushed past human performance on it, and it became a standard line in model report cards. Its later usefulness is limited by contamination, since its examples now appear widely in training corpora, but it remains a landmark example of using adversarial filtering to build a hard evaluation.

Sources

Last verified June 7, 2026