Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

“Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena” was submitted to arXiv on June 9, 2023 by a team led by Lianmin Zheng and Wei-Lin Chiang from the LMSYS group, with co-authors including Eric Xing, Joseph Gonzalez, and Ion Stoica. It established “LLM-as-a-judge” - using a strong model like GPT-4 to grade the open-ended answers of other models - as a practical and reasonably trustworthy evaluation method.

The problem it addressed is that open-ended chat quality is expensive to measure. Multiple-choice benchmarks miss helpfulness and instruction-following, while human preference studies are slow and costly. The paper proposed two complementary tools: MT-Bench, a set of 80 challenging multi-turn questions across categories like writing, reasoning, math, and coding; and Chatbot Arena, the crowdsourced head-to-head voting platform. It then tested whether GPT-4’s judgments matched human judgments on these.

The headline finding was that strong LLM judges such as GPT-4 agree with both controlled expert ratings and crowdsourced human votes more than 80 percent of the time - roughly the rate at which humans agree with each other. The paper was careful to document the method’s biases, including a tendency to favor longer answers, the first answer presented (position bias), and answers in the judge’s own style, and it proposed mitigations like swapping answer order.

Why business readers should care: automated LLM grading made it cheap to compare models and iterate on products, and it is now embedded in evaluation pipelines across the industry - though its biases mean it supplements rather than replaces human review.

Sources

Last verified June 7, 2026