PolymathicAll ideas →
Computer Science & AI

Mechanistic Interpretability

Reverse-engineer the algorithms a trained neural network has learned — islands of clarity in a large unknown.

A working neural network is legible to no one. We can train it, deploy it, and measure its outputs — but the hundreds of billions of weights inside it are an alien artifact, a frozen pattern of numbers that performs computations we cannot read. Mechanistic interpretability is the research programme that takes this seriously: the goal is not just to predict what a model will do, but to reverse-engineer the algorithm it has learned and understand its internal computations the way one might understand a piece of human-written software.

The early wins were in vision. Chris Olah and collaborators (first at Google, then at OpenAI, then at Anthropic) showed that early CNN layers learn edge detectors, middle layers learn textures and parts, deep layers learn object detectors — and that these features are connected by interpretable circuits that compute recognizable algorithms (curve detection, dog-head detection). The frontier moved to transformers. Anthropic's 2024 Mapping the Mind of a Large Language Model used sparse autoencoders to extract millions of human-interpretable features from a frontier LLM — features for the Golden Gate Bridge, for code bugs, for sycophancy, for deception. The deep difficulty is superposition: networks compactly encode more features than they have neurons by using overlapping linear combinations, which is why naïve neuron-by-neuron interpretation fails. The research toolkit (probing, ablation, activation patching, sparse autoencoders, dictionary learning) is improving rapidly; current understanding is best described as islands of clarity in a large unknown.

Why it matters now

Interpretability matters most because it bears on AI safety. If we cannot read a model's internal reasoning, we cannot tell whether it is genuinely solving a problem or pattern-matching its way to a plausible answer; we cannot tell whether it has learned a deceptive strategy; we cannot tell whether alignment training reaches the underlying behaviour or just the surface. The field is small, fast-moving, and disproportionately concentrated at the frontier labs (Anthropic, OpenAI, DeepMind) and a handful of academic groups. Whether interpretability scales fast enough to keep up with capability is one of the genuinely consequential research questions of the decade.

Read it in Polymathic →Browse the catalogue
Polymathic — a curated catalogue of the ideas worth keeping across twelve disciplines. polymathic.app