In this talk given at an AI alignment workshop, Chris Olah introduces mechanistic interpretability, the effort to reverse-engineer the internal computations of neural networks into human-understandable components. Rather than treating a trained model as an inscrutable black box, this research program tries to identify the features a network represents and the circuits that combine them.
Olah explains key ideas from the field, including how individual neurons can correspond to recognizable concepts, how features combine into circuits that implement specific behaviors, and the problem of superposition, where networks pack more features than they have neurons by representing them in overlapping ways. He frames recent progress as turning what once looked like a fundamental obstacle into a more tractable engineering challenge.
Understanding what models compute internally is central to making advanced AI safe and trustworthy, since we cannot fully rely on systems whose reasoning we cannot inspect. As one of the founders of this line of work, Olah gives an authoritative and unusually clear account of where it stands. For a reader curious about whether we can ever open up the black box, this talk is an excellent introduction.