Open LLM Leaderboard

The Open LLM Leaderboard was a public ranking run by Hugging Face that evaluated openly available language models on the same fixed set of benchmarks, asking identical questions in the same order so that scores were directly comparable and fully reproducible. Hugging Face created it after finding that published model scores were often impossible to reproduce or were tuned to flatter a particular model, and wanted a neutral place to separate marketing claims from real progress.

In its first version, models were scored with the EleutherAI LM Evaluation Harness on six benchmarks: ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, and GSM8K. The leaderboard became a major community resource, drawing more than two million unique visitors over roughly ten months and about 300,000 monthly active community members submitting and discussing models. As top models began saturating these tests, Hugging Face archived the original in June 2024 and replaced it with a second version using harder benchmarks (including MMLU-Pro, GPQA, MuSR, MATH, IFEval, and BBH).

The leaderboard mattered because it turned a noisy field of self-reported numbers into an apples-to-apples comparison, helping practitioners pick models on evidence. Its eventual saturation also illustrated a recurring lesson in AI: benchmarks have a shelf life, and as models improve the yardstick has to be made harder.

Sources

Related