MBPP (Mostly Basic Python Problems)

MBPP, short for Mostly Basic Programming Problems, is one of the standard tests for whether a model can write working code from a plain-English description. Each task pairs a short problem statement with test cases, and a model passes only if its generated function produces the correct outputs.

The benchmark was introduced by Jacob Austin, Augustus Odena, Quoc Le, Charles Sutton, and colleagues at Google in “Program Synthesis with Large Language Models,” posted in August 2021. It contains 974 tasks meant to be solvable by entry-level programmers, covering common operations like string manipulation, list processing, and simple math.

The paper showed that code-generation ability scales log-linearly with model size, and that the largest model reached 59.6 percent on MBPP using few-shot prompting. The authors also studied human-in-the-loop feedback, where a person clarifies a task after a failed attempt, and found it improved results.

MBPP is usually reported together with HumanEval as a paired measure of coding skill. For a business reader, these benchmarks underpin many of the “AI can write code” claims and are the numbers vendors point to when comparing coding assistants.

Sources

Last verified June 7, 2026