Convolutional & Recurrent Networks

CNNs assume locality and translation-invariance; RNNs assume sequence. The pre-transformer architectures of the 2012–2017 deep-learning boom; biases live on inside transformers.

Suggested next → Visual Perception · MIND

The brief

By the late 1980s, Yann LeCun at Bell Labs had a working convolutional neural network reading handwritten zip codes for the US Postal Service — a real product, doing real work, mostly ignored outside the lab. The architecture even had a name, LeNet, and a clear lineage back to the visual cortex, yet CNNs sat at the margins of computer vision for two decades, starved of the data and the compute they needed to shine. Then in 2012 AlexNet (Krizhevsky, Sutskever, Hinton) won the ImageNet competition by a margin so wide that hand-engineered computer-vision pipelines, the accumulated craft of a generation of researchers, were essentially abandoned overnight; the difference was not a new idea but enough labelled images and a way to train on graphics hardware. The recurrent neural network, with its LSTM and GRU refinements, ran the parallel story for sequences — translation, speech, language modelling — until 2017, when transformers replaced it in nearly every application.

Both architectures are inductive biases: assumptions about the structure of the data, baked into the network so that less of the structure has to be learned from scratch. A CNN assumes that meaningful features are local (a pixel's relationship to its neighbours matters more than to a pixel across the image) and translation-invariant (a cat is a cat anywhere in the frame). It implements those assumptions with a convolution — a small filter, or receptive field, slid across the image, reusing the same weights at every position so that one learned pattern detects its feature everywhere at once. That weight sharing is also what makes the network compact: a handful of filters stand in for what a fully connected layer would need millions of parameters to express. Pooling then summarizes each region, discarding precise location to gain robustness to small shifts. Stack such layers and a hierarchy emerges on its own: early filters fire on edges and textures, middle layers on motifs and shapes, later ones on parts and whole objects — a coarse echo of how the visual system itself seems to build perception. An RNN takes the opposite shape of data as given — it assumes the input is a sequence and processes it one step at a time, carrying a hidden state forward as a running memory of what came before, a natural fit for language and time series. LSTMs and GRUs added gating mechanisms to combat vanishing gradients — the tendency for error signals to decay exponentially as they propagate back through long sequences, leaving distant context effectively unlearnable; the gates let the network choose what to remember, what to forget, and what to let through unchanged. CNNs still dominate computer vision (medical imaging, satellite imagery, video). RNNs have largely been deprecated in favour of transformers, which process whole sequences in parallel rather than step by step.

Why nowThe deep lesson is that the right inductive bias — locality for images, sequentiality for text — is what made deep learning work on each domain before enough data and compute existed for the bias to become unnecessary. Transformers are a more general architecture that learns the structure rather than assuming it; they need more data and compute to do so, but the result generalizes further, which is why they swept first through language and then into vision. Modern vision is increasingly vision transformers (ViT, 2020) rather than CNNs — though hybrids and pure CNNs (ConvNeXt, 2022) remain fully competitive, a reminder that no single architecture has the last word, and the lines between the two families keep blurring. The conceptual descendants of convolution and recurrence live on inside transformer variants, often unrecognized: locality reappears as windowed attention, recurrence as the cached state of a model generating one token at a time.