PolymathicAll ideas →
Computer Science & AI

Sorting & Searching

log₂ of a trillion is forty — the reason sorted data structures are everywhere.

Tony Hoare, a 26-year-old British graduate student, invented Quicksort in 1960 while waiting for a Russian-to-English translation system to compile. The original problem was sorting Russian dictionary words. Hoare's algorithm — pick a pivot, partition the array around it, recurse on each side — is O(n log n) on average and remains, sixty-five years later, the most-used general-purpose sorting algorithm in production code. The same period produced Mergesort (von Neumann, 1945), Heapsort (Williams, 1964), and Binary Search (correctly implemented in published code only in 1962, after several off-by-one errors). Donald Knuth devoted an entire volume of The Art of Computer Programming (Volume 3, 1973) to these algorithms; a 1960s reference book on sorting would still serve a working programmer today.

Sorting is the canonical place where Big-O acquires a concrete handle. Naïve sorting — bubble, selection, insertion — is O(n²). For a million items, that's 10^12 operations, hours of work. Comparison-based sorting has a proved lower bound of O(n log n): you cannot do better than n log n comparisons in the worst case, because each comparison conveys at most one bit of information about the permutation, and there are n! permutations to distinguish. Mergesort and Heapsort achieve this bound in the worst case; Quicksort achieves it on average (O(n²) worst case if the pivot choice is bad — production implementations randomize the pivot). For a million items, n log n is ~20 million operations — a fraction of a second. Non-comparison sorts (counting sort, radix sort) can do better than O(n log n) by exploiting structure in the data. Searching: linear search is O(n); binary search on a sorted array is O(log n). The key insight is that log n grows extraordinarily slowly — log₂ of a trillion is only 40. This is why sorted data structures are everywhere: the cost of maintaining sorted order is far cheaper than repeatedly linear-searching unsorted data. Hash tables achieve O(1) average-case search by giving up sorted-order properties. The choice between sorted-array-and-binary-search vs. hash-table is one of the most-recurring performance decisions in programming. Standard variations: external sorting (data too large to fit in RAM), parallel sorting (across thousands of machines), stable vs. unstable sorts (whether equal-key items preserve order), and adaptive sorts (faster on partially-sorted input — Timsort is the Python and Java default).

Why it matters now

Modern production-language sort defaults are usually Timsort (Python's list.sort() and Java's Arrays.sort()), introduced by Tim Peters in 2002. Timsort is a hybrid of mergesort and insertion sort that exploits naturally-occurring runs; it is faster than pure quicksort on real-world inputs and stable (Quicksort is not). The right production sort is the result of forty years of practical refinement, not the algorithm that comes first in textbooks. Database query planners decide whether to use sort-then-merge, hash-join, or index-lookup. Search engines are at heart enormous inverted-index problems; the data structures behind them are descendants of Knuth's TAOCP Volume 3. Vector search — approximate nearest-neighbour search in high-dimensional embedding spaces (HNSW, LSH, IVF indices) — is the modern frontier, the data-structure foundation of every retrieval-augmented LLM system.

Read it in Polymathic →Browse the catalogue
Polymathic — a curated catalogue of the ideas worth keeping across twelve disciplines. polymathic.app