PlanBench is a benchmark for evaluating whether large language models can plan and reason about change, introduced by Karthik Valmeekam and colleagues from Subbarao Kambhampati’s group in a June 2022 paper. Instead of using commonsense scenarios where a model might lean on memorized text, PlanBench draws on domains from the classical automated-planning community, such as Blocksworld, where a correct plan is a precise sequence of actions that can be verified mechanically against the rules of the domain. This lets the benchmark separate genuine plan generation from plausible-sounding text.
The results were a sharp corrective to claims that LLMs can plan. The authors found that model performance fell well short even for state-of-the-art systems on core tasks like generating a valid plan from scratch, and that apparent successes often did not survive when the problems were perturbed or generalized. The benchmark and its follow-up studies became a central reference point in the debate over whether language models reason or merely retrieve and recombine patterns, and the group has continued to test newer reasoning models against it.
PlanBench matters because planning is exactly what many proposed AI agents are supposed to do: take a goal and produce a reliable, executable sequence of steps. A benchmark that grades plans by whether they actually work, not whether they sound right, is a needed check on agent hype.