Let's Verify Step by Step (process supervision)

“Let’s Verify Step by Step,” submitted to arXiv on May 31, 2023 by Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe of OpenAI, asked a sharp question about how to train models to reason reliably: should you reward only the final answer, or each step along the way?

The paper compares two kinds of reward model. Outcome supervision gives feedback only on whether the final answer is correct. Process supervision gives feedback on every intermediate reasoning step. The authors found that process supervision significantly outperforms outcome supervision for training a reliable verifier - a model used to rank candidate solutions. A process-supervised reward model selected correct solutions well enough to solve 78 percent of problems from a representative subset of the MATH test set.

To enable this work, the team released PRM800K, a dataset of 800,000 step-level human feedback labels used to train their best reward model. These process reward models (PRMs) became a core ingredient in later test-time-compute systems, where a verifier scores intermediate steps to guide search.

Why business readers should care: this paper established that checking a model’s work step by step, not just its final answer, produces more trustworthy reasoning. The same idea underlies modern reasoning models and any system that needs to catch errors in a chain of logic, not just at the end.

Let's Verify Step by Step (process supervision)

Sources

Related