Persona Vectors: Monitoring and Controlling Character Traits in Language Models

In this August 2025 research release, Anthropic introduced “persona vectors”: directions in a model’s activation space that correspond to character traits such as being evil, sycophantic, or prone to making things up. A persona vector is the pattern of internal activity that, when present, pushes the model toward that trait.

Anthropic showed three practical uses. First, monitoring: by measuring how strongly a persona vector is active, you can detect personality shifts during deployment or during training, catching cases where a model starts drifting toward an undesirable character. Second, mitigation through “preventative steering,” where the relevant direction is nudged during training so the model does not acquire the bad trait in the first place, reportedly without degrading general capability. Third, data flagging: persona vectors can identify training examples that would push the model toward negative traits before that data is used.

The work connects to high-profile incidents of unwanted model personalities, from Microsoft’s Bing “Sydney” outbursts to subtler problems like excessive flattery, and it builds on the same activation-space techniques as refusal-direction and emergent-misalignment research. It turns “the model became weird after an update” into something measurable and adjustable.

For businesses, persona vectors offer a route to keeping a deployed assistant on-brand and trustworthy - detecting and correcting tone or behavior drift at the level of the model’s internals rather than only through output filtering.

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Sources

Related