LiveCodeBench

LiveCodeBench was built to solve a growing problem with code benchmarks: once a test set is public, its problems can leak into the training data of newer models, inflating scores. LiveCodeBench avoids this by continuously gathering fresh problems over time, then evaluating each model only on problems published after that model’s training cutoff.

The benchmark was introduced by Naman Jain, King Han, Ion Stoica, Koushik Sen, and colleagues in “LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code,” posted in March 2024. It pulls new problems from three competitive-programming platforms, LeetCode, AtCoder, and Codeforces, and the initial release covered problems from May 2023 to May 2024. Beyond plain code generation, it also scores self-repair, code execution, and test-output prediction.

By dating every problem, LiveCodeBench can show whether a model’s apparent skill holds up on questions it could not have seen during training, which is a direct test for contamination.

For a general reader, LiveCodeBench captures an important shift in how the field measures progress: moving from fixed, one-time test sets toward continuously refreshed evaluations that are harder to game.

Sources

Related