Constitutional Classifiers: Defending Against Universal Jailbreaks

Announced by Anthropic in February 2025, constitutional classifiers are a system-level defense against jailbreaks - prompts crafted to trick a model into producing content it should refuse. Rather than relying only on the model’s own trained-in caution, the approach adds separate classifier models that screen what goes into the model and what comes out.

The classifiers are trained on synthetic data generated from a “constitution,” a written set of rules describing what categories of content are allowed and disallowed. Because the training data is synthetic and rule-driven, the defense can be updated quickly as new threats appear and can be tuned to specific harm categories. Input classifiers catch malicious prompts; output classifiers catch harmful generations that slip through, including streaming output partway through.

Anthropic stress-tested the system with a public red-teaming competition offering prizes, and reported that it reduced jailbreak success to 4.4 percent - over 95 percent of attempts refused - while adding only a small increase in refusals of legitimate queries and modest extra compute cost. The full results were also published as an arXiv paper.

For organizations deploying AI, this illustrates the defense-in-depth principle: a robust safety posture does not depend on the model alone but layers external, independently updatable guardrails around it, which is especially important given how cheaply model-only safety can be bypassed.

Constitutional Classifiers: Defending Against Universal Jailbreaks

Sources

Related