PolymathicAll ideas →
Computer Science & AI

Large Language Models

Transformer networks trained on the public internet, scaled past where new capabilities emerge — the technology this cycle turns on.

In June 2017, eight researchers at Google published Attention Is All You Need — a paper introducing the transformer architecture, framed as a focused contribution to machine translation. Within five years it had displaced the recurrent neural networks that had previously dominated language; within seven, it had colonised image, video, code, audio, and protein-structure prediction as well. The lineage from one paper to ChatGPT (November 2022, 100 million users in two months) to the frontier models of 2025 is one of the most concentrated technological accelerations in human history — and one of the most contested in interpretation.

The central innovation is self-attention: each position in a sequence computes a weighted average of every other position's representation, with the weights learned dynamically from content. This replaced the sequential, one-token-at-a-time processing of RNNs with a parallel computation across the whole sequence — a massive practical advantage on modern GPUs. The original paper bundled three pieces: multi-head attention (several attention computations in parallel, attending to different aspects of context), positional encodings (since attention is order-independent, position is added explicitly), and layer normalisation with residual connections (which make very deep stacks trainable). On top of the architecture, the pre-training recipe — train a transformer to predict the next token of a large text corpus — turned out, given enough parameters and data, to produce models with surprisingly broad capabilities far beyond the training objective. That second observation, more than the architecture itself, is what made the transformer the dominant primitive of modern AI.

Why it matters now

Modern frontier models — from OpenAI, Anthropic, Google, Meta, and DeepSeek among others — are all transformer-based, with hundreds of billions to trillions of parameters. Open-weights models (Llama, Mistral, DeepSeek, Qwen) have closed much of the gap with closed frontier labs and are reshaping the political economy of AI. Inference cost per equivalent capability has fallen by roughly 100× since 2022, a curve faster than Moore's law. The transformer is the substrate; what to do with it (scaling, reasoning models, alignment, agentic tool use) has become its own set of concepts.

Read it in Polymathic →Browse the catalogue
Polymathic — a curated catalogue of the ideas worth keeping across twelve disciplines. polymathic.app