Generalized Advantage Estimation was introduced in “High-Dimensional Continuous Control Using Generalized Advantage Estimation,” posted to arXiv on June 8, 2015 by John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel at UC Berkeley. It tackles a core difficulty in policy-gradient reinforcement learning: the gradient estimates used to improve a policy are unbiased but extremely noisy, which makes training slow and unstable.
The advantage function measures how much better an action is than the policy’s average behavior in a given state. GAE provides an exponentially-weighted estimator of that advantage, governed by a single parameter that trades off bias against variance. Lower variance means the agent can learn from fewer samples without introducing too much bias. The paper demonstrated the method on hard 3D locomotion tasks, learning controllers directly from raw state.
GAE is a quiet but foundational piece of modern RL: it is a standard component inside Trust Region Policy Optimization and Proximal Policy Optimization, the latter of which became the workhorse for aligning large language models. For a general reader, it is an example of how a statistical estimation trick can be as important to progress as any new architecture.