pass@k

pass@k is the headline metric used to score AI models on code-generation benchmarks. The idea is simple: ask the model to produce k candidate solutions to a problem, run them against the test cases, and count the problem as solved if at least one candidate passes. pass@1 measures whether a single attempt works, while pass@10 or pass@100 measures whether the model can solve a problem given several tries.

The metric was introduced and formalized by Mark Chen, Jerry Tworek, Wojciech Zaremba, and colleagues at OpenAI in “Evaluating Large Language Models Trained on Code,” the 2021 paper behind Codex and the HumanEval benchmark. The authors noted that sampling many solutions is a surprisingly effective strategy, reporting that their model solved 70.2 percent of problems when allowed 100 samples per problem, far more than with a single attempt. To make the estimate stable, they compute pass@k by generating a larger pool of samples and calculating the expected pass rate rather than relying on a single noisy run.

The choice of k matters when interpreting results. A high pass@100 with a low pass@1 means a model can find a correct answer but often does not on its first try, which is important if there is no automatic way to check which sample is right.

For a business reader, pass@k clarifies what a coding benchmark really claims. A strong pass@1 means the assistant tends to get it right immediately, whereas strong results only at high k mean a human or a test suite still has to sift the good answers from the bad.

Sources

Related