PolymathicAll ideas →
Computer Science & AI

Convolutional & Recurrent Networks

CNNs assume locality and translation-invariance; RNNs assume sequence. The pre-transformer architectures of the 2012–2017 deep-learning boom; biases live on inside transformers.

By the late 1980s, Yann LeCun at Bell Labs had a working convolutional neural network reading handwritten zip codes for the US Postal Service — a real product, doing real work, mostly ignored outside the lab. CNNs sat at the margins of computer vision for two decades. Then in 2012 AlexNet (Krizhevsky, Sutskever, Hinton) won the ImageNet competition by a margin so wide that hand-engineered computer-vision pipelines were essentially abandoned overnight. The recurrent neural network, with its LSTM and GRU refinements, ran the parallel story for sequences — translation, speech, language modelling — until 2017, when transformers replaced it in nearly every application.

Both architectures are inductive biases: assumptions about the structure of the data, baked into the network so that less of the structure has to be learned from scratch. A CNN assumes that meaningful features are local (a pixel's relationship to its neighbours matters more than to a pixel across the image) and translation-invariant (a cat is a cat anywhere in the frame). It implements those assumptions with a convolution — a small filter slid across the image — followed by pooling to summarize regions. An RNN assumes the data is a sequence and processes it one step at a time, carrying a hidden state forward. LSTMs and GRUs added gating mechanisms to combat vanishing gradients — the tendency for error signals to decay exponentially as they propagate back through long sequences. CNNs still dominate computer vision (medical imaging, satellite imagery, video). RNNs have largely been deprecated in favour of transformers, which process sequences in parallel rather than step by step.

Why it matters now

The deep lesson is that the right inductive bias — locality for images, sequentiality for text — is what made deep learning work on each domain before enough data and compute existed for the bias to become unnecessary. Transformers are a more general architecture that learns the structure rather than assuming it; they need more data and compute to do so, but the result generalizes further. Modern vision is increasingly vision transformers (ViT, 2020) rather than CNNs — though hybrids and pure CNNs (ConvNeXt, 2022) remain competitive. The conceptual descendants of convolution and recurrence live on inside transformer variants, often unrecognized.

Read it in Polymathic →Browse the catalogue
Polymathic — a curated catalogue of the ideas worth keeping across twelve disciplines. polymathic.app