Large Language Models

Transformer networks trained on the public internet, scaled past where new capabilities emerge — the technology this cycle turns on.

Suggested next → Reinforcement Learning from Human Feedback · CS·AI · T4

The brief

A large language model is, at bottom, a system trained to do one narrow thing: predict the next token — the next fragment of text — given everything that came before. Run that simple objective over an enormous corpus of human writing, with a network of sufficient size, and something unexpected happens. The model does not merely learn to complete sentences; it absorbs, as a side effect of getting the next word right, a working command of grammar, fact, style, reasoning, and the conventions of dozens of human languages and formal systems. To predict text well enough, it turns out, a model is pressed to internalise a great deal about the world the text describes. That is the central surprise of the field — the leap from a single statistical objective to broad apparent competence — and the reason a tool built merely to autocomplete became, within a few years, a general instrument for working with language.

The architecture underneath is the transformer, whose attention mechanism lets every position in a text weigh every other — a separate concept treated in its own right. What makes the result a language model is the training. Text is first broken into tokens by a learned vocabulary, so the model reasons over sub-word fragments rather than raw characters or whole words; this keeps the vocabulary finite while still spelling out any rare term. Pre-training then optimises one loss: predict the next token across trillions of words of text, adjusting billions of parameters until the predictions sharpen. Because almost any knowledge can be posed as a fill-in-the-blank, this single objective quietly forces the model to encode syntax, fact, and inference all at once. None of that knowledge is stored as a lookup table; it is compressed into the weights themselves, which is why a model can recombine what it has read into sentences that never appeared in its corpus. A model trained this way is fluent but unsteered — as happy to continue a question as to answer it. A second stage, alignment, makes it useful. Instruction tuning teaches it to treat text as a request to be fulfilled; reinforcement learning from human feedback (RLHF) then nudges its outputs toward responses people judge helpful and honest, trading a little raw fluency for steerability. The most striking property appears only with scale: capabilities absent in small models — multi-step reasoning, translation, rudimentary arithmetic — emerge as size and data grow, often without being trained for directly, so that quantitative growth tips over into qualitative change. Related is in-context learning: shown a few examples inside the prompt itself, a model can perform a task it was never explicitly fine-tuned to do, learning within the conversation rather than from any update to its weights.

Why nowThe honest limits matter as much as the capabilities. A language model has no inherent grounding in truth: it is optimised to produce plausible continuations, and plausibility and accuracy only partly overlap. When the two diverge, the model will state falsehoods with the same confident fluency it brings to facts — the failure mode called hallucination or confabulation. Nothing in the next-token objective rewards saying I don't know, so the model rarely volunteers it. Output is also acutely sensitive to the prompt: small changes in wording can swing the answer, so phrasing the request well has itself become a skill. None of this is a passing defect to be patched away; it follows from what the system is — a model of how text tends to continue, not a model of the world. Used with that understanding, large language models are a genuinely new kind of tool. Mistaken for oracles, they mislead in exactly the fluent, confident voice that makes them persuasive.