In 1943, Warren McCulloch and Walter Pitts modelled a neuron as a binary threshold unit — fire if the weighted sum of inputs exceeds a threshold — and showed that networks of such units could compute any logical function. In 1958 Frank Rosenblatt built the Perceptron, a single-layer network with a learning rule, and the New York Times described it as the embryo of a machine that would eventually walk, talk, and be conscious. Eleven years later, Minsky and Papert proved single-layer networks could not even learn XOR, the press coverage collapsed, and the field entered its first long winter. The fix — multi-layer networks trained with backpropagation — was rediscovered several times before Rumelhart, Hinton, and Williams gave it canonical form in 1986. A second winter followed in the 1990s. The deep-learning revolution arrived in 2012, when AlexNet won ImageNet by a margin that ended the argument.
A neural network is a stack of linear transformations (multiply by a weight matrix, add a bias) interleaved with nonlinear activations (sigmoid historically, ReLU/GELU now). The universal approximation theorem (Cybenko 1989, Hornik 1991) guarantees that with enough hidden units such a network can approximate any continuous function — but says nothing about how many, or how to find the weights. The practical answer is backpropagation: define a loss measuring how wrong the network is, use the chain rule to propagate the gradient of the loss back through every weight, then nudge each weight in the direction that reduces the loss. Iterate. Variants (momentum, Adam, learning-rate schedules) and stochastic gradient descent on mini-batches turn what is mathematically straightforward into something that scales to networks with hundreds of billions of parameters. The deep surprise of the last decade is that this very simple recipe, applied at scale, keeps producing capabilities the theory does not predict.
Backpropagation is the most economically consequential algorithm of the twenty-first century. Every modern AI system — image recognition, speech, translation, recommenders, self-driving, AlphaFold, ChatGPT, Claude, Gemini — is a feed-forward network trained by gradient descent on a loss. The biological-plausibility critique (real neurons probably don't run backprop) remains an open question for theoretical neuroscience and a non-issue for practice. The architectural story (CNNs, transformers, diffusion) and the interpretability story (what's actually inside a trained network) have grown into their own concepts; this brief is the foundation they sit on.