Activation Addition: Steering Language Models at Inference Time

“Activation Addition: Steering Language Models Without Optimization” (later titled “Steering Language Models With Activation Engineering”) was submitted to arXiv on August 20, 2023 by Alexander Matt Turner, Lisa Thiergart, Monte MacDiarmid, and colleagues. It demonstrates a lightweight way to change a model’s behavior at the moment it runs, without any fine-tuning.

The technique, called Activation Addition or ActAdd, derives a steering vector from a single pair of contrasting prompts. By running the model on, for example, a prompt containing “Love” and one containing “Hate” and subtracting their internal activations, the method captures a direction in activation space corresponding to that contrast. Adding this vector into the model’s activations during a later forward pass nudges its output in the desired direction - toward more positive sentiment, or away from toxic content - while leaving unrelated behavior largely intact.

What makes the approach notable is its cost: it needs no gradient updates, no training data beyond the one contrasting pair, and no extra model. It is closely related to the broader idea of representation engineering and to the feature-steering demonstrated in interpretability work like Scaling Monosemanticity, where amplifying or suppressing an internal feature changes behavior causally.

For a business reader, activation steering points toward cheap, fine-grained control over model behavior after deployment - a complement to expensive retraining. It also underscores a deeper point from interpretability: a model’s behavior lives in manipulable internal directions, which is both a control opportunity and a safety consideration.

Activation Addition: Steering Language Models at Inference Time

Sources

Related