Reasoning Models & Inference-Time Compute

Train models to spend tokens thinking before answering. Test-time compute as a scaling axis distinct from pre-training.

Suggested next → Dual-Process Theory · MIND

The brief

Through 2023, frontier-model capability was dominated by pre-training scale — bigger models, more data, more compute up front, then a single forward pass at query time. The seed of the shift was an old, almost embarrassingly simple trick: let a model write out its chain of thought — intermediate steps, scratch work, dead ends — before committing to an answer, and its performance on hard reasoning jumps. In 2024, OpenAI released o1, an LLM trained to spend tokens thinking before answering — generating long internal chains of reasoning the user mostly never sees. The capability gain on hard mathematical, scientific, and coding problems was large enough that the other major labs — Anthropic, Google, and the open-weights R1 from DeepSeek — shipped reasoning variants of the same recipe within months.

The training pattern, broadly: take a strong pre-trained LLM; reward it for producing chains of thought that lead to correct answers on hard problems; the model learns to spend more tokens — sometimes thousands — exploring, backtracking, checking its own work before committing. This is inference-time compute as a scaling axis, distinct from the pre-training scaling laws (Kaplan 2020, Chinchilla 2022) that had governed the prior era. The broader principle is that you can buy accuracy on a hard problem by spending more computation at the moment you answer rather than only during training — and there is more than one way to spend it: longer deliberate reasoning, sampling many candidate solutions and selecting the best, or explicit search and verification over a space of partial answers. What training adds is the discipline to use that budget well, rather than merely burning it. Capability gains on the hardest benchmarks — graduate-level science, olympiad mathematics, competitive programming — have been steep. The mechanism is partial: the chains of thought look like reasoning, but whether they reflect the model's actual computation or are post-hoc rationalisation is an open question, with implications for trust, evaluation, and safety. There are limits — latency and cost rise with every extra token, and more tokens are not always better; past some point a model talks itself out of a correct answer. Emergent capabilities — sudden jumps at scale (Wei et al. 2022; contested by Schaeffer et al. 2023, who argued some emergence is an artifact of metric choice) — get a second life under this paradigm: more thinking time produces qualitatively different behaviour on a fixed model.

Why nowThe deeper reframing is a shift from scale the pre-training to also scale the thinking at answer time — two complementary axes rather than one. The most visible recent gains have come predominantly from the latter; whether further pre-training scale keeps delivering, and how the two axes interact, is being actively worked out. The frontier debate — capability ceilings, the path to AGI, what reasoning models tell us about cognition — has become partly empirical, partly philosophical. The honest position is that the trajectory is uncertain and the intellectually-defensible range of forecasts is wide.