IMPALA was introduced in “IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures,” posted to arXiv on February 5, 2018 by Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, and collaborators at DeepMind. It addresses how to scale reinforcement learning across many machines without sacrificing learning stability.
The architecture decouples the agents that act in the environment from the learner that updates the policy. Many actors run in parallel collecting experience and send it to a centralized learner. Because the actors run a slightly stale copy of the policy, their data is technically off-policy, which can bias learning. IMPALA’s central contribution is V-trace, an off-policy correction that compensates for this lag. The result was stable learning at high throughput, training a single agent across all 57 Atari games and 30 DeepMind Lab tasks with better data efficiency than prior agents.
IMPALA became a template for large-scale RL systems, including later game-playing efforts. For a business reader, it is an early answer to a practical infrastructure question: how to keep a learning system correct while spreading its work across a fleet of machines.