Conservative Q-Learning was introduced in “Conservative Q-Learning for Offline Reinforcement Learning,” posted to arXiv on June 8, 2020 by Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine at UC Berkeley and Google. It targets offline reinforcement learning, the setting where an agent must learn purely from a fixed dataset of logged experience without any further interaction with the environment.
The central problem in that setting is distributional shift. A learned value function will confidently assign high value to actions that never appear in the data, and the policy then exploits these fictitious values, producing behavior that looks great in the model but fails in reality. CQL addresses this by adding a regularizer to the standard Bellman objective that pushes down the estimated values of out-of-distribution actions, so the learned Q-function lower-bounds the true value. The authors reported two-to-five times higher returns than prior offline methods on discrete and continuous control benchmarks.
Offline RL matters commercially because in many real domains, such as healthcare or recommendation, online trial-and-error is too risky or expensive, and historical logs are all that is available. For a general reader, CQL is a representative answer to the question of how to learn good decisions safely from data you already have.