In 2017, eight Google Brain researchers published a paper with a titular claim that turned out to be approximately correct: Attention Is All You Need. The paper introduced the Transformer architecture, which dispensed with the recurrent and convolutional networks that had dominated sequence modelling for a decade and replaced them with a single primitive: attention. Within five years, every major language model was a transformer. ChatGPT, Claude, Gemini, the protein-folding AlphaFold, the image-generators Stable Diffusion and DALL·E, the music-generation systems, the code-generation systems — all transformers, all running on attention.
Self-attention lets each token in a sequence look at every other token and compute a weighted sum of their representations, where the weights are learned. The intuition is that context is a weighted average of every relevant token so far. The mechanism has three nice properties recurrent networks lacked: it is highly parallelizable (every token can attend to every other simultaneously), it has direct connections between distant tokens (no information bottleneck through hidden state), and it scales gracefully with both data and parameters. The transformer architecture stacks self-attention layers with feed-forward layers, residual connections, and layer normalization, and trains the whole thing with backpropagation on enormous text corpora. The scaling laws — Kaplan et al. 2020, Hoffmann et al. 2022 — empirically showed that loss falls predictably as model size, data size, and compute increase; this turned scaling into a programme rather than a guess. GPT-3 (2020) demonstrated that sufficiently large transformers exhibit emergent capabilities — in-context learning, few-shot reasoning, code generation — that smaller versions did not. GPT-4 (2023), Claude 3 and successors, Gemini, and the open-source Llama family all extended the same architecture with refinements (RLHF, mixture-of-experts, longer context windows).
The transformer is the dominant computational primitive of modern AI, and the question of whether it is sufficient for AGI or merely a very capable specialized architecture is the subject of the loudest current debate in the field. Mixture-of-experts, state-space models (Mamba, RWKV), and retrieval-augmented approaches are the most-watched architectural variants. The Transformer paper has been cited over 130,000 times, making it one of the most influential individual papers in the history of computer science. The economic infrastructure built on top of it — the GPU shortages, the data-centre buildout, the API economy — is genuinely unprecedented in its speed and scale.