Dynabench: Rethinking Benchmarking in NLP

“Dynabench: Rethinking Benchmarking in NLP” is an April 2021 paper led by Douwe Kiela with a large group from Facebook AI and academic collaborators. It responds to a recurring frustration in natural language processing: models race to the top of static benchmarks like GLUE, yet still fail on examples that humans find easy, and once a benchmark is solved it stops measuring progress. Dynabench proposes a moving target instead, an open platform for human-and-model-in-the-loop dataset creation.

The mechanism is adversarial annotation. Human annotators try to write examples that a target model will get wrong but that another person would get right. Examples that fool the model are collected, the model is retrained, and the process repeats, so the benchmark continually evolves to probe the current model’s weaknesses rather than sitting frozen. The paper demonstrated the approach across four NLP tasks and argued it produces harder, more revealing test sets that resist the saturation and contamination that plague static benchmarks.

Dynabench matters as an early, influential answer to a problem that has only grown: as models train on ever more of the internet, fixed benchmarks leak into training data and lose meaning. The idea of dynamic, regenerating evaluation, now seen in later benchmarks too, traces much of its momentum to this work.

Sources

Last verified June 7, 2026