Constitutional AI: Harmlessness from AI Feedback

“Constitutional AI: Harmlessness from AI Feedback” was posted to arXiv on December 15, 2022 by Yuntao Bai and a large group of coauthors at Anthropic. It introduced the training method behind Anthropic’s Claude models and described a way to make a model harmless without relying on humans to label harmful outputs.

The method replaces much of the human feedback used in standard reinforcement learning from human feedback (RLHF) with AI feedback guided by an explicit written document - the “constitution” - consisting of a list of principles. It proceeds in two stages. In a supervised stage, the model generates responses, then critiques and revises its own responses according to the constitutional principles, and is fine-tuned on the revised, improved answers. In a reinforcement-learning stage, the model compares pairs of responses and judges which better follows the constitution; these AI-generated preferences train a reward model, which is then used to further train the system. The authors call this second stage reinforcement learning from AI feedback (RLAIF).

The advantage the paper claims is twofold. Because the rules of behavior are written down explicitly, the model’s values become more transparent and easier to inspect and adjust than values implicitly encoded in thousands of human labels. And because harmfulness judgments come from the model rather than from human raters, far fewer human labels identifying harmful content are needed, which the authors argue reduces the human exposure to disturbing material that RLHF can require. The technique also lets the model explain its objections rather than simply refusing.

Constitutional AI became a defining part of Anthropic’s approach and a widely studied alternative to pure human-feedback alignment. It connects to the broader question, examined in the company’s later research, of how reliably a model’s trained-in values actually govern its behavior.

Sources

Last verified June 7, 2026