The Leaderboard Illusion

“The Leaderboard Illusion” is an April 2025 paper, with authors including Sara Hooker and Sayash Kapoor, that scrutinizes Chatbot Arena, the widely cited platform where human voters compare anonymized model outputs to produce an Elo-style ranking. The paper argues that several undisclosed practices systematically distort the resulting scores, so the leaderboard reflects more than raw model quality.

The critique identifies specific mechanisms. Some providers can privately test many model variants on the Arena before deciding which to release publicly, effectively cherry-picking the best-scoring version, a kind of overfitting to the leaderboard. The authors also document unequal sampling rates, where some models are shown to voters far more often than others, and asymmetric access to the resulting battle data, which advantages large labs that can mine it to improve future models. Together these create what they call a self-reinforcing illusion of progress that favors a handful of well-resourced providers.

The paper matters because Chatbot Arena had become one of the most influential ways the public ranks AI models. Showing that its scores can be gamed and are not a level playing field is a caution against treating any single leaderboard as ground truth, and it sharpened the broader conversation about how to evaluate models fairly.

Sources

Related