Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

“Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference” was submitted to arXiv on March 7, 2024 by a team led by Wei-Lin Chiang and Lianmin Zheng, with co-authors including Michael Jordan, Joseph Gonzalez, and Ion Stoica, from the LMSYS group spanning UC Berkeley, Stanford, and UCSD. It is the research paper behind the live ranking site originally called Chatbot Arena and later LMArena.

The platform’s design is simple: a visitor types a prompt, two anonymous models answer side by side, and the visitor votes for the better response. Only after voting are the model names revealed. This pairwise, blind setup turns ordinary users into evaluators on real, diverse questions rather than a fixed test set. By the time of the paper the platform had collected over 240,000 votes; it has since gathered millions.

To convert pairwise votes into a single leaderboard, the authors use statistical ranking methods - an Elo-style and Bradley-Terry approach that estimates each model’s latent quality and provides confidence intervals. The paper argues that the crowdsourced prompts are diverse and discriminating, and that aggregate human votes agree well with expert raters, defending the arena as a credible benchmark despite its uncontrolled nature.

The arena became influential precisely because it is hard to game with memorization: questions are fresh and human-judged, sidestepping the benchmark-contamination problem that plagues static multiple-choice tests. Its main criticisms are that it rewards style and agreeableness, that vote populations are not representative, and that providers can target it.

Why business readers should care: the arena is the most-watched public signal of which chatbot people actually prefer, and it shaped how labs market model quality.

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Sources

Related