HumanEval

HumanEval measures whether an AI model can turn a written description of a programming task into code that actually works. Each problem provides a function signature and a docstring, and the model’s generated code is run against hidden unit tests. A solution counts only if it passes those tests, so the benchmark rewards functional correctness rather than code that merely looks plausible.

HumanEval was introduced by OpenAI in the 2021 paper “Evaluating Large Language Models Trained on Code,” which had 58 authors including first author Mark Chen along with Alec Radford, Ilya Sutskever, and Wojciech Zaremba. This is the same Codex work that underpinned early AI coding assistants. The dataset is released openly through OpenAI’s human-eval GitHub repository.

It became a standard because it offered a simple, reproducible, automatically-graded measure of code generation at a time when that capability was rapidly emerging. For years it was the default headline number for “can this model code.”

Current pass rates are tracked on official and community leaderboards and change as models improve, so they are not reproduced here.

Sources

Related