Data Structures

Hash tables, trees, graphs, heaps: a handful of shapes, and the choice between them decides what's fast.

Suggested next → Compilers & Interpreters · CS·AI

The brief

In 1953, an IBM researcher named Hans Peter Luhn had a deceptively simple idea. Instead of hunting through a list to find something, why not compute where it should be from the thing itself? A function turns the key straight into an address, so a lookup that would have meant scanning a million entries becomes, on average, a single step. Luhn's hash table is the invention that, more than any other, separates fluent programming from clumsy programming. It belongs to a small family of organizing shapes — lists, trees, graphs, heaps, stacks, queues — that a programmer chooses among before writing a line of logic. As Niklaus Wirth put it in a title that became a slogan, Algorithms + Data Structures = Programs: the algorithm is what to do, the structure is what to do it on.

The shapes are few because each answers one question well and pays for it somewhere else. An array gives instant access to any element by its position but is slow to grow in the middle; a linked list is the exact reverse. A hash table finds things in a single average step but scatters them across memory. A tree keeps its data sorted, so that a search cuts the problem in half at every step — which is why nearly every database and filesystem is a tree underneath. A graph models anything defined by its connections: a social network, the web, a tangle of software dependencies. The lesson beneath all of them is that how the data is laid out matters more than almost any other decision a programmer makes. The same task can run a thousand times faster or slower depending only on the structure holding the data, and when a program crawls, the cure is usually not a cleverer algorithm but a better-shaped container. Two forces govern the choice. One is the old trade of speed against memory — a hash table is fast but hungry, while a leaner structure buys back space by tolerating the occasional wrong answer. The other is the hardware itself: a modern processor reads memory in a straight line about a hundred times faster than it jumps around, so a plain array will often beat a theoretically superior structure riddled with scattered pointers, whatever the abstract analysis promises.

Why nowThese shapes are everywhere in working code — the hash table behind every language's built-in dictionary, the balanced tree behind every database index, the content-addressed store behind Git. The live frontier is search in very high dimensions, where the meaning of an image or a sentence is encoded as a long list of numbers and ordinary structures break down because everything ends up roughly equidistant from everything else. The specialized indexes that make this tractable — the machinery inside vector databases — now sit at the heart of every system that lets a language model look up what it needs to know. But the field's most durable gift is humbler than any single structure: it is the vocabulary itself, and the habit of asking, before anything else, what shape the data wants to take.