Process Reward Model (PRM)

A process reward model, or PRM, is a model trained to judge the quality of each individual step in a chain of reasoning rather than only the correctness of the final answer. It contrasts with an outcome reward model (ORM), which scores only the end result. PRMs became prominent through OpenAI’s 2023 paper “Let’s Verify Step by Step.”

The distinction matters because a model can reach a correct final answer through flawed reasoning, or fail late in an otherwise sound chain. By labeling each step as good or bad, a PRM gives much denser feedback. The “Let’s Verify Step by Step” authors found that process supervision produced a substantially more reliable verifier than outcome supervision, and released PRM800K, a dataset of 800,000 human step-level labels, to train such models. Their best PRM could select correct solutions for 78 percent of a representative MATH test subset.

PRMs are central to test-time-compute systems. When a model generates many candidate solutions, a PRM can score the reasoning steps and steer a search toward the most promising paths, or rank final candidates. This makes them a building block of modern reasoning models that explore multiple solution paths before committing to an answer.

Why business readers should care: a process reward model is how an AI system can catch a mistake in the middle of a chain of logic, not just at the end. For applications where the reasoning has to be auditable - finance, law, medicine - step-level verification is the mechanism that makes that possible.

Sources

Related