PolymathicAll ideas →
Computer Science & AI

Compilers & Interpreters

How text becomes a running program — lex, parse, walk. Every programmer relies on it; few look inside.

In 1957, an IBM team led by John Backus released FORTRAN — the first high-level programming language with a serious compiler. The team had spent three years on it, and the output was competitive with hand-written assembly code, which most contemporary programmers had assumed was impossible. The breakthrough redefined what programming could be: instead of writing machine instructions you wrote mathematical formulas, and the compiler translated. Within a decade dozens of languages followed and the theoretical foundation — formal grammars (after Chomsky 1956), parsing algorithms, type systems, optimization theory — was being built in parallel; Aho and Ullman's Dragon Book of 1986 fixed the canonical pedagogy. Sixty-eight years later, the same pipeline — lex, parse, type-check, optimize, generate code — is still what every compiler does.

Every modern compiler is built around essentially the same six-stage pipeline. Lexical analysis (driven by regular expressions and a finite automaton) reads the source text and produces a stream of tokens, parsing turns that stream into an Abstract Syntax Tree using a context-free grammar — typically a top-down recursive-descent parser (Clang and rustc's approach) or a bottom-up LR parser (yacc, bison) — and semantic analysis walks the tree to build a symbol table and check consistency. Modern compilers extend semantic analysis with type inference: the Hindley-Milner algorithm of 1978 was the foundation, still visible in ML, Haskell, and OCaml, and in modulated form in Rust, Scala, and TypeScript. Intermediate representation — typically three-address code or static single-assignment (SSA) form — makes the program easier to optimize, and LLVM IR has become the dominant modern target, so languages compile to LLVM IR and inherit its optimization passes for free. Optimization runs dozens of passes (constant folding, dead-code elimination, common-subexpression elimination, loop unrolling, function inlining, register allocation) before code generation emits machine code for x86-64, ARM, or RISC-V — or a bytecode (JVM, .NET CLR, WebAssembly) that a virtual machine later interprets or JITs. The compiler-versus-interpreter distinction has blurred: most so-called interpreted languages (Python, Ruby, JavaScript) actually compile to bytecode and run it on a VM, while just-in-time compilation (Java's HotSpot, V8, the new CPython JIT) interprets first and then compiles hot paths to machine code at runtime, often within a factor of two of ahead-of-time builds. The type system is the lever that decides what bugs the compiler can catch before the program runs: static systems (Java, C++, Rust, TypeScript, Haskell) catch them at compile time; dynamic systems (untyped Python, Ruby, JavaScript) catch them at runtime; dependent types (Coq, Agda, Lean) let types depend on values and prove properties like this list is non-empty; affine systems (Rust's borrow checker) prevent resource-handling bugs by construction.

Why it matters now

LLVM, started by Chris Lattner in 2000 as a graduate-school project, is now the dominant open-source compiler infrastructure — Clang, rustc, Swift, Julia, and dozens of others all target it — while GCC still anchors Linux-kernel development and V8 drives the browser and Node ecosystem. WebAssembly (2017) lets compiled C, C++, Rust, and Go run at near-native speed inside browsers and server runtimes, one of the more consequential systems-software developments of the last decade. The compiler stack has also become the substrate of the AI revolution: XLA, MLIR, TVM, TensorRT, and NVIDIA's CUDA turn high-level Python ML code into optimized GPU kernels, and frontier-LLM performance depends critically on how well those compilers do their job. AI-assisted programming (Copilot, Cursor, Claude Code) sits on top of all this, generating source that the traditional compiler then turns into machine code — and whether the LLM eventually displaces the high-level-language step is an open research question.

Read it in Polymathic →Browse the catalogue
Polymathic — a curated catalogue of the ideas worth keeping across twelve disciplines. polymathic.app