Refusal in Language Models Is Mediated by a Single Direction

This 2024 paper by Andy Arditi, Neel Nanda, and colleagues showed that a chat model’s decision to refuse a harmful request is governed, to a surprising degree, by a single direction in its internal activation space. The authors found this one-dimensional “refusal direction” consistently across 13 popular open-source chat models.

The evidence is causal, not just correlational. Subtracting the refusal direction from the model’s activations strips out refusals, so the model complies with requests it would normally reject. Adding the direction back, or amplifying it, makes the model refuse even harmless instructions. Because the intervention is a simple edit to internal activations rather than retraining, the authors used it to build a lightweight white-box jailbreak, and they showed that known adversarial suffixes work in part by suppressing this same direction.

The result is important on two fronts. Scientifically, it is a clean example of the “linear representation” idea: a high-level behavior encoded along a single interpretable axis. Practically, it is a warning. Any safety behavior that lives in one removable direction is fragile against anyone with access to model internals, which is the situation for every open-weights release.

For businesses deploying open models, the takeaway is that fine-tuned safety guardrails can be cheaply undone by a motivated party, so safety cannot rest on the model alone and needs system-level defenses around it.

Refusal in Language Models Is Mediated by a Single Direction

Sources

Related