Scaling Laws

AI capability is a power law in compute, data, and parameters.

Suggested next → Reasoning Models & Inference-Time Compute · CS·AI

The brief

In January 2020, Jared Kaplan and colleagues at OpenAI published Scaling Laws for Neural Language Models — a careful empirical study showing that the test loss of transformer language models decreased predictably as a power law in three quantities: the number of parameters (N), the amount of training data (D), and the amount of compute (C). The relationship held across seven orders of magnitude and admitted no obvious diminishing returns at any scale tested. Two years later, Hoffmann et al. at DeepMind published Training Compute-Optimal Large Language Models (the Chinchilla paper) which corrected Kaplan's compute-allocation prescription: optimal N and D should grow together, in roughly equal proportion. Together these papers transformed AI research from an art whose progress was hard to predict into an industrial pipeline whose returns on compute were forecastable.

What Kaplan reported was an empirical regularity rather than a theoretical result: across seven orders of magnitude, the test loss of transformer language models fell as a power law in three quantities — the number of parameters, the amount of training data, and the amount of compute — with smooth curves and no inflection points indicating diminishing returns. The same exponents reappeared across model families running the same overall recipe, which made the result look less like an architectural quirk and more like a property of the loss landscape itself. For an industry that had spent the previous decade depending on chance breakthroughs, the practical consequence was enormous: capability returns on compute were now forecastable. The Chinchilla paper from DeepMind in 2022 corrected the original recipe in a way that mattered. Kaplan's experiments had been confounded by suboptimal learning-rate schedules, and under properly tuned conditions parameters and data should grow together, in roughly equal proportion. The Chinchilla model — 70 billion parameters trained on 1.4 trillion tokens — outperformed contemporary 175- to 280-billion-parameter models trained on less data, and the resulting rule of thumb (around twenty tokens per parameter) became the operating recipe of the post-2022 frontier. Inference-time scaling, in which test-time compute (longer chains of thought, more samples, search over reasoning traces) trades against pretraining compute, opened a second scaling regime that OpenAI's o1 and o3 series exploited. Emergent capabilities have been catalogued in the hundreds, but Schaeffer's work in 2023 argued that many emergences are artefacts of binary accuracy metrics; on continuous metrics the underlying capability emerges smoothly. By 2024 several lines of evidence suggested raw pretraining gains were tapering: high-quality text data is increasingly exhausted, with synthetic data filling part of the gap on mixed results. The DeepSeek release of late 2024 and early 2025 achieved frontier-class capabilities at substantially lower training cost than US labs had been spending.

Why nowScaling laws are now the planning instrument of the AI industry. Capital expenditure on data centres, GPU clusters, and energy contracts; model architecture choices and data curation investments; talent allocation across labs — all are made with explicit reference to projected scaling curves. Frontier training clusters of a hundred thousand H100-class GPUs are routine, and clusters approaching a million GPUs are under construction across OpenAI, xAI, Anthropic, Meta, and Google as direct consequences of scaling-law-derived projections. Training a frontier model now consumes gigawatt-scale electricity, with data centres increasingly co-located with new nuclear and renewable installations. US export controls on advanced GPUs to China are themselves motivated by scaling-law arguments about the strategic value of compute concentration.