PolymathicAll ideas →
Computer Science & AI

Data Structures

Hash tables, trees, graphs, heaps: a handful of shapes, and the choice between them decides what's fast.

In 1953, Hans Peter Luhn, a researcher at IBM, invented the hash table — the data structure that, more than any other, distinguishes modern programming from naive programming. The trick was simple to describe and had been hinted at by mathematicians for decades: instead of searching through a list to find a key, compute the location from the key itself using a function (the hash function) that maps keys to array indices, making lookup, insertion, and deletion all average-case O(1) instead of O(n). Hash tables, lists, trees, graphs, heaps, queues, stacks — the small set of organizing shapes — are the primary architectural decision a programmer makes when sitting down to solve a problem, and Niklaus Wirth's 1976 textbook title — Algorithms + Data Structures = Programs — captured the relationship: the algorithm is what to do, the data structure is what to do it on, and the combination is the program.

The major families are surprisingly few. Linear structures — arrays (O(1) random access, O(n) middle insert) and linked lists (O(1) head insert, O(n) random access) — are the simplest, with the choice governed by access pattern, and hash tables (Luhn's contribution) give average O(1) on lookup, insert, and delete but degrade to O(n) when adversarial input causes collisions (the 2011 hash-flooding attacks exploited this; the fix was randomized hash functions). Trees span the universe of comparison-organized data: binary search trees (O(log n) balanced, O(n) not), self-balancing variants (red-black, AVL, B-tree, B+ tree) underlie nearly every relational database (Postgres, MySQL, SQLite) and many file systems, tries power autocomplete, and heaps implement priority queues and Dijkstra's algorithm. Graphs (adjacency-list for sparse, adjacency-matrix for dense) are the substrate of social networks, the internet, and dependency graphs, while specialized structures fill specific niches — Bloom filters probabilistically test set membership with a tunable false-positive rate, skip lists are the probabilistic alternative to balanced trees (used in Redis), and persistent data structures underlie functional programming and Git's content-addressed storage. The deeper insight is that data-structure choice determines performance more than any other single decision: a bad choice can make a system 1000× slower than necessary, and the fix is rarely to optimize the algorithm but to change the structure. The recurring tradeoffs are time-versus-space (hash tables are fast but memory-heavy; Bloom filters save memory but admit false positives), read-optimized versus write-optimized (B-trees favour reads, while log-structured merge trees in LevelDB, RocksDB, Cassandra, and BigTable favour high write throughput), and cache locality — modern CPUs are roughly 100× faster at sequential memory access than random, so arrays and vectors outperform pointer-chasing structures by enormous margins even when the Big-O analysis says otherwise.

Why it matters now

Modern production code uses these structures everywhere: hash tables in nearly every language's standard library (`std::unordered_map`, `HashMap`, `dict`, `Map`), B-trees and LSM-trees in every major database, persistent data structures in functional languages and Git, tries in autocomplete and IP routing tables, Bloom filters in network protocols and distributed databases. Cache-friendly variants (Google's flat hash maps and Robin Hood hashing) appear in performance-critical code. The current frontier is vector databases (Pinecone, Weaviate, FAISS, Milvus, pgvector) for high-dimensional nearest-neighbour search where traditional structures fail because of the curse of dimensionality; the data structures behind them — HNSW, LSH, product-quantization indices — are an active research area at the heart of every retrieval-augmented LLM system. The working programmer's main intuition is exactly the fit between data layout and access pattern, and the discipline's most durable contribution to engineering practice is the vocabulary itself.

Read it in Polymathic →Browse the catalogue
Polymathic — a curated catalogue of the ideas worth keeping across twelve disciplines. polymathic.app