PolymathicAll ideas →
Computer Science & AI

Reasoning Models & Inference-Time Compute

Train models to spend tokens thinking before answering. Test-time compute as a scaling axis distinct from pre-training.

Through 2023, frontier-model capability was dominated by pre-training scale — bigger models, more data, more compute up front, then a single forward pass at query time. In September 2024, OpenAI released o1: an LLM trained to spend tokens thinking before answering, generating long internal chains of reasoning that the user mostly does not see. The empirical capability gain on hard mathematical, scientific, and coding problems was substantial enough that within months Anthropic's Claude with extended thinking, Google's Gemini 2.5, and DeepSeek's open-weights R1 had all shipped variants of the same recipe.

The training pattern, broadly: take a strong pre-trained LLM; reward it for producing chains of thought that lead to correct answers on hard problems; the model learns to spend more tokens — sometimes thousands — exploring, backtracking, checking its own work before committing. This is inference-time compute as a scaling axis, distinct from the pre-training scaling laws (Kaplan 2020, Chinchilla 2022) that had governed the prior era. Capability gains on benchmarks like GPQA (graduate-level science), AIME (competition mathematics), and Codeforces have been steep. The mechanism is partial: the chains of thought look like reasoning, but whether they reflect the model's actual computation or are post-hoc rationalisation is an open question, with implications for trust, evaluation, and safety. Emergent capabilities — sudden jumps at scale (Wei et al. 2022; contested by Schaeffer et al. 2023, who argued some emergence is an artifact of metric choice) — get a second life under this paradigm: more thinking time produces qualitatively different behaviour on a fixed model.

Why it matters now

Pre-training compute and inference-time compute are now understood as complementary axes. Frontier-model gains in 2024–2025 came predominantly from the latter; whether further pre-training scale continues to deliver, and how the two axes interact, is being actively worked out. The frontier debate — capability ceilings, the path to AGI, what reasoning models tell us about cognition — has become partly empirical, partly philosophical. The honest position is that the trajectory is uncertain and the intellectually-defensible range of forecasts is wide.

Read it in Polymathic →Browse the catalogue
Polymathic — a curated catalogue of the ideas worth keeping across twelve disciplines. polymathic.app