Benchmark data contamination

Benchmark data contamination is what happens when the test questions used to measure a model have leaked into the data the model was trained on. Modern large language models are trained on enormous web-scale corpora that inevitably sweep up the very benchmarks researchers later use to score them. When that happens, a high score may reflect the model recalling the answer it memorized rather than demonstrating genuine capability, and comparisons between models become unreliable.

The 2024 survey “Benchmark Data Contamination of Large Language Models” (arXiv:2406.04244) defines the problem as language models inadvertently incorporating evaluation benchmark information from their training data, leading to inaccurate or unreliable performance during evaluation. The survey and related work catalog several flavors of contamination - from the test text itself appearing in training data, to test questions paired with their labels, to reworded or augmented copies that evade simple string matching - and report that contamination rates on popular benchmarks can range from a few percent to well over half of the examples.

Because the effect is invisible from the outside, contamination is a quiet driver of the broader reproducibility problems in machine learning, and it makes leaderboard rankings easy to game, intentionally or not. Researchers have responded with detection methods and with dynamic or held-out benchmarks that are refreshed so models cannot have seen them.

For anyone choosing a model based on benchmark scores, the practical caution is that a number on a leaderboard is only meaningful if the model could not have trained on the questions. The most trustworthy evidence is performance on a fresh, private task that mirrors the real job at hand.

Sources

Related