Q-learning was introduced by Christopher Watkins in his 1989 Cambridge PhD thesis and then formalized in “Q-learning,” a 1992 paper by Watkins and Peter Dayan published in Machine Learning (Vol. 8, pages 279-292). The algorithm is one of the most widely used methods in reinforcement learning because it is both simple and model-free: an agent can learn how to act optimally without ever building a model of how its environment works.
The method learns a function, written Q, that estimates the long-run value of taking a given action in a given state and then behaving optimally afterward. The agent repeatedly tries actions, observes the reward and the next state, and nudges its Q estimate toward the reward plus the best estimated value available next. Crucially, the update uses the best next action rather than the action actually taken, which is what lets the agent learn the optimal policy even while exploring with suboptimal moves.
The 1992 paper’s main contribution was a detailed convergence proof. Watkins and Dayan showed that, in a finite Markov decision process, Q-learning converges to the correct optimal action values with probability one, provided every state-action pair continues to be tried and the values are stored discretely. That guarantee made Q-learning a trusted building block, and its deep-network successor, the Deep Q-Network, brought it to fame by learning to play Atari games from raw pixels.