Proximal Policy Optimization (PPO)

Proximal Policy Optimization was introduced in “Proximal Policy Optimization Algorithms,” posted to arXiv on July 20, 2017 by John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov at OpenAI. PPO is a policy-gradient method: it improves an agent by alternating between collecting experience from the environment and optimizing a surrogate objective with stochastic gradient ascent.

PPO’s appeal is that it captures much of the stability of the earlier Trust Region Policy Optimization while being far simpler to implement and tune. Instead of solving a constrained optimization at each step, PPO uses a clipped objective that discourages the new policy from moving too far from the old one in a single update. This lets it reuse a batch of collected data for several epochs of minibatch updates, which improves sample efficiency, and it works well across a wide range of tasks with little hyperparameter fiddling.

That combination of stability, simplicity, and generality made PPO the default reinforcement learning algorithm for much of the field. Its most consequential application came later: PPO is the optimizer at the heart of reinforcement learning from human feedback, the technique used to align ChatGPT and other large language models with human preferences.

Proximal Policy Optimization (PPO)

Sources

Related