EvalPlus (HumanEval+ and MBPP+)

EvalPlus addresses a quiet weakness in code benchmarks: if a test suite has too few test cases, plausible-looking but subtly wrong code can pass and inflate scores. EvalPlus rebuilds existing benchmarks with many more tests, generated through a mix of model-driven and mutation-based methods, to check functional correctness much more rigorously.

The framework was introduced by Jiawei Liu, Chunqiu Steven Xia, Lingming Zhang, and a colleague in “Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation,” posted in May 2023. Applying it produced HumanEval+ and MBPP+, with HumanEval expanded by roughly 80 times more tests. The stronger tests caught large amounts of previously undetected wrong code, lowering pass rates by 19.3 to 28.9 percent, and even reshuffled the leaderboard so that some models that trailed on the original HumanEval came out ahead.

The result was a widely cited warning that benchmark scores can be artifacts of weak test coverage rather than true ability.

For a business reader, EvalPlus underscores that “passes the benchmark” is not the same as “writes correct code,” and that the quality of the test itself determines how much a score can be trusted.

Sources

Last verified June 7, 2026