MT-Bench

MT-Bench is a benchmark for grading how well chat assistants handle open-ended, conversational questions, including follow-ups that require staying coherent across multiple turns. Because there is no single correct answer to questions like “write a travel itinerary,” MT-Bench uses a strong model such as GPT-4 to act as an automated judge and score each response.

MT-Bench was introduced by Lianmin Zheng, Wei-Lin Chiang, Ion Stoica, and colleagues in “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,” posted in June 2023. The same paper studied the reliability of using a model as a judge, finding that GPT-4 agreed with human preferences more than 80 percent of the time, while also identifying biases such as favoring longer answers, preferring responses in certain positions, and rating a model’s own outputs too highly.

MT-Bench became a common quick check during model development because it is automated and fast, complementing slower, human-driven evaluations.

For a business reader, MT-Bench is the kind of evaluation behind many “our assistant is more helpful” claims, and understanding that a model is grading other models helps explain both the speed and the limits of such scores.

Sources

Related