Backpropagation

Blame flows backward along the chain rule.

Suggested next → Dopamine & Reward · MIND

The brief

In 1986, a paper by Rumelhart, Hinton, and Williams in Nature gave the most influential treatment of an algorithm called backpropagation — a way of efficiently computing how a neural network's weights should change to reduce its error. The core idea is older: closely related methods had been worked out independently several times in the 1960s and 70s, and Seppo Linnainmaa had described the underlying reverse-mode differentiation in 1970. What the 1986 paper supplied was the cultural moment at which the field recognized that training multi-layer networks was solved in principle — that error at the output could be assigned, fairly, to every weight that helped produce it. Forty years later, every modern AI system runs on backpropagation, and the consequences have rearranged industries.

Backpropagation is the chain rule applied at industrial scale. A neural network is a chain of differentiable transformations from input to output to a loss function (how wrong the network is). Each training step runs a forward pass — data flows through the layers to a prediction — and then a backward pass, in which the error at the output is propagated backward, layer by layer, so that each weight learns how much it contributed to the mistake. The chain rule lets you compute the gradient of the loss with respect to every parameter by multiplying local Jacobians as you go. With the gradient in hand, gradient descent (or a stochastic variant) nudges every parameter a small step downhill, toward lower loss; repeat for billions of examples. The genius of the technique is computational efficiency: a forward and a backward pass each cost O(network size), which is what made deep networks actually trainable rather than a theoretical curiosity. The early obstacles were real — the vanishing- and exploding-gradient problem, where error signals shrink to nothing or blow up over many layers, long kept depth out of reach. It was tamed by a stack of innovations: ReLU activations, normalization layers, and residual connections that give gradients a clean path backward. The 2012 AlexNet result on ImageNet, which crushed hand-engineered vision pipelines, was the empirical proof that the recipe worked at scale. Everything since — image generation, voice assistants, AlphaGo, GPT, Claude, AlphaFold — has been an application or extension of the same paradigm.

Why nowBackpropagation is, by orders of magnitude, the most economically consequential algorithm of the twenty-first century. The current frontier — large language models, diffusion models, multimodal systems, robotics policies — is all backpropagation at increasing scale. The biological-plausibility critique (real neurons probably do not implement backprop) remains an active question for theoretical neuroscience, but the pragmatic AI community treats it as a non-issue: whatever the brain does, backprop works.