TD3 was introduced in “Addressing Function Approximation Error in Actor-Critic Methods,” posted to arXiv on February 26, 2018 by Scott Fujimoto, Herke van Hoof, and David Meger, and presented at ICML 2018. The paper diagnoses a specific failure mode in continuous-control actor-critic methods such as DDPG: the critic systematically overestimates action values, and those errors accumulate into unstable, poorly performing policies.
TD3, short for Twin Delayed Deep Deterministic policy gradient, addresses this with three mechanisms. It trains two critics and uses the minimum of their estimates to limit overestimation, it updates the policy less frequently than the critics to reduce per-update error, and it smooths the target by adding noise so the critic does not over-trust a sharp value estimate. Together these changes made the method substantially more reliable than its predecessor.
TD3 became one of the standard baselines for continuous control alongside Soft Actor-Critic. For a general reader, it is a clean example of how progress in machine learning often comes not from a bigger model but from identifying and correcting a subtle statistical bias in how the system learns.