LiveBench is a general-purpose benchmark designed to resist two common problems at once: test-set contamination and the unreliability of using a model to grade other models. It draws questions from recent sources such as math competitions, newly posted arXiv papers, news articles, and datasets, and it scores every answer against objective ground truth rather than a judge model.
The benchmark was introduced by Colin White, Tom Goldstein, Yann LeCun, and colleagues in “LiveBench: A Challenging, Contamination-Limited LLM Benchmark,” posted in June 2024 and accepted to ICLR 2025 as a spotlight. It spans six categories: math, coding, reasoning, language, instruction following, and data analysis, and it adds new questions monthly to stay ahead of training cutoffs. Several categories use harder variants of existing benchmarks such as BIG-Bench Hard and IFEval.
At release the authors noted that top models scored below 70 percent, leaving clear room to distinguish stronger systems from weaker ones.
For a general reader, LiveBench shows where benchmark design is heading: toward constantly refreshed, objectively scored tests that are harder for a model to have memorized and that do not depend on another model’s judgment.