Mechanistic Interpretability

Reverse-engineer the algorithms a trained neural network has learned — islands of clarity in a large unknown.

Suggested next → AI Alignment & Evaluation · CS·AI

The brief

A working neural network is legible to no one. We can train it, deploy it, and measure its outputs — but the hundreds of billions of weights inside it are an alien artifact, a frozen pattern of numbers that performs computations we cannot read. Mechanistic interpretability is the research programme that takes this seriously: rather than treating the model as a black box to be probed only from the outside by its behaviour, it tries to reverse-engineer the trained network into human-understandable parts — the features it represents and the circuits that combine them into computation. The goal is not just to predict what a model will do, but to reverse-engineer the algorithm it has learned and understand its internal workings the way one might understand a piece of human-written software — line by line, mechanism by mechanism, until the behaviour stops being a surprise.

The early wins were in vision. Chris Olah and collaborators showed that early CNN layers learn edge detectors, middle layers learn textures and parts, deep layers learn object detectors — and that these features are connected by interpretable circuits that compute recognizable algorithms (curve detection, dog-head detection). Crucially, a circuit can be tested: ablate or amplify its parts and the behaviour shifts as the story predicts, which turns a plausible narrative into a causal claim rather than a just-so explanation. The frontier then moved to transformers, where researchers found small reusable mechanisms such as induction heads — paired attention components that notice a token has appeared before and copy what followed it, a simple in-context copying rule that underlies much of a model's startling ability to learn from its own prompt. But neuron-by-neuron reading kept failing, because most neurons are polysemantic: a single unit fires for many unrelated concepts at once. The deep cause is superposition: a network packs more features than it has dimensions by storing them as overlapping linear combinations, trading a little interference for far greater representational capacity. The response has been sparse dictionary learning — training sparse autoencoders that decompose tangled activations into a large vocabulary of cleaner, monosemantic features, each one ideally meaning a single thing. Using them, work like Mapping the Mind of a Large Language Model extracted millions of human-interpretable features from a frontier model — features for landmarks, for code bugs, for sycophancy, for deception — and showed they could be turned up or down to steer behaviour. The toolkit (probing, ablation, activation patching, dictionary learning) improves rapidly; current understanding is best described as islands of clarity in a large unknown.

Why nowInterpretability matters most because it bears on AI safety. If we cannot read a model's internal reasoning, we cannot tell whether it is genuinely solving a problem or pattern-matching its way to a plausible answer; we cannot tell whether it has quietly learned a deceptive strategy; we cannot tell whether alignment training reaches the underlying behaviour or only files down the surface. The same gap erodes trust: a system we cannot inspect is one we must take on faith, and faith does not scale to high-stakes deployment in medicine, finance, or law. The promise of interpretability is a different relationship — auditing a model's mechanisms before we rely on them, and catching failures by reading the computation rather than waiting for it to misbehave. This is a young, fast-moving field, only a few years old in its modern form and disproportionately concentrated at the frontier labs and a handful of academic groups; its methods are still rough and its findings provisional. Whether interpretability scales fast enough to keep up with capability is one of the genuinely consequential research questions of the decade.