Richard Bellman developed dynamic programming at the RAND Corporation in the early 1950s. He set out the theory in the RAND paper “The Theory of Dynamic Programming” (P-550, 1954), the text of an address to the American Mathematical Society, and then in his 1957 book “Dynamic Programming,” published by Princeton University Press. Dynamic programming is a method for solving complex decision problems by breaking them into smaller subproblems and solving them in a recursive, stage-by-stage way. Bellman reportedly chose the deliberately bland name “dynamic programming” in part to keep RAND’s military funders from objecting to abstract mathematical research.
The core result, the Bellman equation, expresses the value of being in a state as the immediate reward plus the discounted value of the best state you can reach next. This recursive relationship is the mathematical foundation of optimal sequential decision-making under uncertainty, and it is exactly the structure that reinforcement learning algorithms try to solve. Bellman also coined the phrase “the curse of dimensionality” to describe how the number of states explodes as a problem gains variables.
Reinforcement learning can be understood as a set of techniques for approximately solving Bellman’s equations when the environment is too large or too unknown to solve exactly. Temporal-difference learning, Q-learning, and value iteration are all built on the Bellman update. Sutton and Barto’s textbook treats Markov decision processes and dynamic programming as the bedrock on which the rest of the field stands.