Sparse Crosscoders for Cross-Layer Features and Model Diffing

Published on Anthropic’s Transformer Circuits thread in October 2024, this note introduces the crosscoder, a generalization of the sparse autoencoder (SAE). A standard SAE reads and writes the activations of a single layer; a crosscoder reads from and writes to several layers at once, learning features that span the network rather than being tied to one location.

The motivation is that real model features are often spread across layers (“cross-layer superposition”), which makes single-layer SAEs both redundant and harder to interpret. By learning shared features across layers, crosscoders can simplify circuit analysis: the same concept does not have to be rediscovered separately at every depth. The authors describe three uses - resolving cross-layer superposition, simplifying circuits, and “model diffing,” where a crosscoder trained jointly on two models reveals which features they share and which are unique.

Anthropic presents the work as preliminary, in the style of an internal research seminar rather than a finished paper. Even so, crosscoders became a building block for later interpretability tooling, including the cross-layer transcoders used in the 2025 circuit-tracing work.

For a general reader, the practical payoff is comparison: crosscoders give a way to ask how a fine-tuned or updated model differs from its predecessor at the level of internal concepts, not just outputs, which is useful for tracking what changes when a model is retrained.

Sparse Crosscoders for Cross-Layer Features and Model Diffing

Sources

Related