Learning to Predict by the Methods of Temporal Differences

Richard Sutton published “Learning to Predict by the Methods of Temporal Differences” in the journal Machine Learning in 1988 (Vol. 3, pages 9-44). The paper introduced a class of incremental learning procedures, which Sutton called temporal-difference (TD) methods, for the problem of using past experience to predict future outcomes.

The central idea sets TD methods apart from conventional supervised prediction. Ordinary methods adjust a prediction by comparing it against the eventual actual outcome. TD methods instead assign credit by comparing each prediction to the next prediction in time, learning from the difference between temporally successive estimates rather than waiting for a final result. Sutton showed that for many real prediction problems this approach needs less memory and less peak computation, and that it can produce more accurate predictions.

Temporal-difference learning became one of the load-bearing ideas of reinforcement learning. It is the mechanism by which an agent can update its estimate of how good a situation is before it sees the final reward, which is what makes learning from delayed feedback practical. The method runs through Q-learning, through Gerald Tesauro’s TD-Gammon, and through DeepMind’s Deep Q-Network, and it was a key contribution cited when Sutton and Andrew Barto received the 2024 Turing Award.

Learning to Predict by the Methods of Temporal Differences

Sources

Related