Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

“Towards Monosemanticity: Decomposing Language Models With Dictionary Learning” was published on October 5, 2023 by Anthropic’s interpretability team. It offered a practical answer to a problem the team had named the year before: individual neurons in a network are often polysemantic, firing for many unrelated concepts at once, which makes the network hard to interpret one neuron at a time.

The method is dictionary learning via a sparse autoencoder. The researchers took a small one-layer transformer and trained an autoencoder to re-express its 512 internal neuron activations as a sparse combination of a much larger set of learned directions, called features. Because the autoencoder is pushed to use only a few features at a time, each feature tends to capture a single, coherent concept rather than a tangle of them.

From those 512 neurons they extracted more than 4,000 features, many of them strikingly interpretable: features that activate on DNA sequences, legal language, HTTP requests, Hebrew text, nutrition statements, and much more. The authors argued these features, not raw neurons, are the right unit of analysis for understanding a network, and showed the features were consistent and causally meaningful.

For a general reader, this work matters because it is a credible path to opening the black box. If the concepts a model uses can be pulled apart and labeled, then auditing, debugging, and steering a model’s behavior become real engineering tasks rather than guesswork. It set the stage for scaling the same idea to a production-grade model the following year.

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Sources

Related