Trust Region Policy Optimization (TRPO)

Trust Region Policy Optimization was introduced in a paper of the same name posted to arXiv on February 19, 2015 and presented at ICML 2015, by John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel, then at UC Berkeley. TRPO addressed a long-standing problem with policy-gradient methods: training was unstable, because a single overly large update could collapse a policy that had taken a long time to learn.

TRPO’s fix is a trust region. At each step it improves the policy but constrains how much the new policy is allowed to differ from the old one, measured by the average KL divergence between them. The authors derived an iterative procedure with a guarantee of monotonic improvement and then turned it into a practical algorithm. They showed it could train large neural-network policies on hard tasks, including simulated robotic locomotion such as swimming, hopping, and walking, and Atari games from raw images, while staying stable with minimal tuning.

TRPO was an important step in making deep reinforcement learning reliable enough for serious use, but its second-order optimization is complex. Two years later the same lead author, John Schulman, distilled the core idea into the much simpler Proximal Policy Optimization, which kept the stability benefits and became the field’s default method.

Trust Region Policy Optimization (TRPO)

Sources

Related