Deep Reinforcement Learning from Human Preferences

“Deep reinforcement learning from human preferences” was submitted to arXiv on June 12, 2017 by Paul Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei, a collaboration between OpenAI and DeepMind. It is widely regarded as the technical origin of reinforcement learning from human feedback (RLHF), the method that later produced InstructGPT and ChatGPT.

The problem the paper attacks is goal specification. For many tasks it is hard or impossible to write down a reward function that captures what we actually want - and a misspecified reward leads an agent to game the metric. Instead of writing a reward function, the authors ask non-expert humans to look at pairs of short video clips of the agent’s behavior and pick which one looks better. A reward model is trained to predict these preferences, and the agent is then optimized against that learned reward with standard reinforcement learning, while more human comparisons are gathered as training continues.

The striking result was efficiency: the method solved complex tasks - Atari games and simulated robot locomotion, including behaviors the researchers had no clean reward for, like teaching a simulated robot to do a backflip - while asking humans to label less than one percent of the agent’s interactions. Roughly an hour of human time could teach behaviors that would be very hard to script.

For business readers, this paper is the conceptual ancestor of every modern chat assistant. The core move - let humans judge outputs and train a model to imitate that judgment - is exactly how today’s large language models are tuned to be helpful and to follow instructions.

Deep Reinforcement Learning from Human Preferences

Sources

Related