“Asynchronous Methods for Deep Reinforcement Learning,” posted to arXiv on February 4, 2016 and presented at ICML 2016 by Volodymyr Mnih and colleagues at DeepMind, introduced a family of parallel training methods, the best known of which is Asynchronous Advantage Actor-Critic, or A3C. Mnih had earlier led the Deep Q-Network work, and A3C tackled one of its practical weaknesses.
The Deep Q-Network relied on a large experience replay buffer to decorrelate its training data and stabilize learning, which is memory-hungry and limits the algorithm to off-policy methods. A3C took a different route: it runs many agents in parallel, each exploring its own copy of the environment, and has them asynchronously update a shared set of network weights. Because the parallel agents are encountering different situations at any moment, their combined updates are naturally decorrelated, which stabilizes training without a replay buffer.
The practical payoff was striking. A3C surpassed the state of the art on the Atari benchmark while training in half the time on a single multi-core CPU, rather than requiring a GPU. The authors also showed it working on continuous control and on 3D navigation tasks. The method made deep reinforcement learning cheaper and more accessible and became a widely used baseline.