Databases & The Relational Model

Codd's move: data as tables, queries as algebra; the rest is engineering.

Suggested next → Set Theory · MATH · T3

The brief

Edgar F. Codd, a British computer scientist at IBM Research San Jose in 1969, had spent years watching application programmers fight pointer-navigation through hierarchical and network databases — IBM's IMS and the CODASYL standard — where a change in physical storage silently broke working code. His response was a six-page paper in Communications of the ACM in June 1970, A Relational Model of Data for Large Shared Data Banks: data as relations (sets of tuples — tables of rows and columns), queries written declaratively, the access plan delegated to a query optimizer in the engine. Logical structure was decoupled from physical storage entirely. IBM was initially unenthusiastic; Codd pushed the work through against institutional resistance and won the 1981 Turing Award.

The relational model rests on a small set of abstractions. A relation is a set of tuples conforming to a schema; a primary key uniquely identifies each row, and a foreign key references a primary key in another table. The relational algebra — selection, projection, join, union, intersection, Cartesian product, rename — is the formal core. Codd's normal forms (first through Boyce-Codd) eliminate data redundancy and update anomalies. SQL, developed at IBM by Chamberlin and Boyce in the mid-1970s and standardized by ANSI in 1986, is the dominant declarative language across every relational database; the query optimizer that converts SQL to an execution plan using table statistics, cardinality estimation, and join-order enumeration is one of the deepest pieces of practical computer science. ACID transactions — Atomicity, Consistency, Isolation, Durability — are the correctness guarantees: a transaction commits or rolls back in full, concurrent transactions behave as if serial, committed effects survive crashes. Jim Gray's 1981 The Transaction Concept synthesized the framework; Gray won the 1998 Turing Award. The CAP theorem (Eric Brewer's 2000 conjecture, Gilbert-Lynch's 2002 proof) showed that in a distributed system you can have at most two of Consistency, Availability, and Partition tolerance; since partitions are inevitable, real systems trade C against A. The NoSQL movement of the late 2000s (MongoDB, Redis, Cassandra, DynamoDB) arose for web-scale workloads where ACID was unaffordable; the field has since substantially returned to SQL through NewSQL systems (Spanner, CockroachDB) that combine horizontal scalability with full ACID.

Why nowMost application data globally is stored in relational databases; the SQL market is roughly $60 billion annually by 2024, with PostgreSQL the default open-source choice and Oracle, SQL Server, and Snowflake dominating the commercial high end. Analytical workloads run on columnar OLAP systems (Snowflake, BigQuery, ClickHouse, DuckDB) queried via SQL on a distributed engine. Vector databases (Pinecone, pgvector, Milvus) emerged after 2021 to store the high-dimensional embeddings on which retrieval-augmented generation runs. SQLite — Richard Hipp's 2000 embedded library — is by some measures the most-deployed software in history, present in every iOS and Android device. Natural-language-to-SQL is starting to blur whether SQL syntax remains the lingua franca, but the relational model itself remains where the engineering depth lives.

Further readingA Relational Model of Data for Large Shared Data Banks (Codd, CACM 1970). Designing Data-Intensive Applications (Kleppmann, 2017). Database System Concepts (Silberschatz et al., 7th ed.).