Direct Preference Optimization (DPO)

“Direct Preference Optimization: Your Language Model is Secretly a Reward Model” was submitted to arXiv on May 29, 2023 by Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn of Stanford. It offered a dramatically simpler route to the same goal as the RLHF pipeline behind InstructGPT.

Classic RLHF has three moving parts: collect human preference rankings, train a separate reward model to predict them, then run reinforcement learning (typically PPO) to push the language model toward higher reward. The reinforcement-learning stage is finicky - it requires sampling from the model during training, careful tuning, and a reference model to keep the policy from drifting too far. DPO’s insight is mathematical: the optimal policy under the usual RLHF objective has a closed-form relationship to the reward, so the reward model can be folded away entirely. What remains is a simple classification-style loss computed directly on pairs of preferred and dispreferred responses.

The practical payoff is that DPO trains a model to match human preferences without a separate reward model and without any reinforcement learning or online sampling - it is more stable, far easier to implement, and computationally lighter. The authors showed it matched or beat PPO-based RLHF on sentiment control, summarization, and dialogue.

DPO and its many descendants became the default preference-tuning method for a large fraction of open-weight models, precisely because it removed the most fragile and expensive stage of the alignment pipeline. It is a clean example of a result where a better derivation, not more compute, simplified a whole subfield.

Sources

Last verified June 7, 2026