REINFORCE and policy-gradient methods (Williams, 1992)

Ronald J. Williams published “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning” in Machine Learning in 1992 (Vol. 8, pages 229-256). The paper introduced a family of algorithms Williams named REINFORCE, and it established the policy-gradient approach to reinforcement learning.

Most early reinforcement learning estimated the value of states or actions and then chose actions greedily from those values. Policy-gradient methods take a different route: they represent the policy itself as a function with adjustable weights, often a neural network with stochastic outputs, and adjust those weights directly in the direction that increases expected reward. Williams showed that his REINFORCE updates move the weights along the gradient of expected reinforcement without ever explicitly computing or storing that gradient, and that the scheme integrates naturally with backpropagation.

This was the conceptual seed for a large branch of modern reinforcement learning. Policy-gradient methods are well suited to continuous action spaces and to large neural-network policies, and they underlie later algorithms such as Trust Region Policy Optimization and Proximal Policy Optimization, the workhorse method now used to fine-tune large language models with human feedback.

REINFORCE and policy-gradient methods (Williams, 1992)

Sources

Related