Attention Mechanism

Context is a weighted average of every token so far.

Suggested next → Large Language Models · CS·AI · T4

The brief

In 2017, eight researchers at Google published a paper with a titular claim that turned out to be approximately correct: Attention Is All You Need. The paper introduced the Transformer architecture, which dispensed with the recurrent and convolutional networks that had dominated sequence modelling for a decade and replaced them with a single primitive: attention. The idea is almost embarrassingly simple. Each token in a sequence broadcasts a query — what am I looking for? — and every other token offers a key advertising what it has; wherever a query and a key match, the matching token's value flows in. Context becomes a search every word runs against every other word. Within five years, every major language model was a transformer. ChatGPT, Claude, Gemini, the protein-folding AlphaFold, the image-generators Stable Diffusion and DALL·E, the music-generation systems, the code-generation systems — all transformers, all running on attention.

Self-attention lets each token in a sequence look at every other token and compute a weighted sum of their representations, where the weights are learned. Concretely, the match between a token's query and another token's key sets the weight, and that weight decides how much of the second token's value flows into the first; do this for every token against every other and each word ends up rewritten as a blend of the words it found relevant. The intuition is that context is a weighted average of every relevant token so far. The mechanism has three nice properties recurrent networks lacked. It is highly parallelizable: where a recurrent network had to pass information along the sequence one step at a time, attention compares the whole sequence at once. It has direct connections between distant tokens — a word at the end can attend straight to a word at the beginning, with no information bottleneck through a single hidden state. And it scales gracefully with both data and parameters. That combination is why attention overtook recurrence: parallel hardware could be saturated, and long-range dependencies stopped decaying with distance. The transformer stacks self-attention layers with feed-forward layers, residual connections, and layer normalization, and trains the whole thing with backpropagation on enormous text corpora. The scaling laws — Kaplan et al. 2020, Hoffmann et al. 2022 — empirically showed that loss falls predictably as model size, data size, and compute increase, which turned scaling into a programme rather than a guess. GPT-3 (2020) demonstrated that sufficiently large transformers exhibit emergent capabilities — in-context learning, few-shot reasoning, code generation — that smaller versions did not, and GPT-4, Claude, Gemini, and the open Llama family all extended the same skeleton. There is a catch: because every token attends to every other, cost grows quadratically with sequence length, which is exactly why long-context work is hard.

Why nowThe transformer is the dominant computational primitive of modern AI — the engine under every large language model in use today — and the question of whether it is sufficient for AGI or merely a very capable specialized architecture is the subject of the loudest current debate in the field. The quadratic cost of attention is the live engineering frontier: Mixture-of-experts, state-space models (Mamba, RWKV), flash attention, and retrieval-augmented approaches are the most-watched variants, most of them aimed at making longer context affordable without paying the full quadratic bill. The Transformer paper is already among the most-cited papers in the history of computer science — a remarkable status for so recent a result. The economic infrastructure built on top of it — the GPU shortages, the data-centre buildout, the trillion-dollar valuations, the API economy — is genuinely unprecedented in its speed and scale.