Deliberative Alignment: Reasoning Enables Safer Language Models

Deliberative alignment is a training method from OpenAI, described in a December 2024 paper by Melody Guan, Eric Wallace, Boaz Barak, and colleagues, and used on the company’s o-series reasoning models. The idea is to teach a model the text of its safety policies and train it to reason explicitly over those policies in its chain of thought before producing an answer.

The contrast with earlier approaches is the source of the safety signal. Standard RLHF teaches a model to imitate human judgments without giving it the underlying rules; deliberative alignment instead gives the model the written specification and trains it to apply that specification through reasoning. The method generates training data without requiring human-written chains of thought or answers, which makes it more scalable than collecting large amounts of human labels.

The reported benefit is that it improves two things that usually trade off against each other. Models trained this way became more resistant to jailbreaks while also refusing fewer legitimate requests, because reasoning over the actual policy lets the model distinguish genuinely disallowed content from superficially similar but harmless content. Because the safety reasoning is written out, it is also more inspectable than an opaque learned preference.

For organizations, the appeal is auditability: a model that can cite the rule it is applying, and reason about edge cases, is easier to trust and to govern than one whose refusals are an inscrutable reflex.

Deliberative Alignment: Reasoning Enables Safer Language Models

Sources

Related