Markov Chains

Memoryless processes — the future depends on the present and not on how you got there.

Suggested next → Statistical Mechanics · PHYS · T5

The brief

In 1906, the Russian mathematician Andrei Markov — already famous for his quarrels with the Russian Orthodox Church and the Tsarist government — published a paper that picked a fight of a different kind. Most of probability theory at that point assumed independence: that successive trials were unconnected, like repeated coin flips. Markov wanted to prove the law of large numbers without the independence assumption, and he proposed the right relaxation: let each trial depend on the immediately preceding one but on nothing further back. The application he is most remembered for came a few years later: in 1913 he tallied the alternation of vowels and consonants in Pushkin's Eugene Onegin, and the dependent-trial machinery worked. The structure he uncovered, now called a Markov chain, turned out to describe a startling number of natural and computational processes.

A Markov chain is a sequence of random variables X₁, X₂, X₃, … taking values in a state space S, with the Markov property: the conditional distribution of the next state depends only on the current state, not on the history. Formally, P(X_{n+1} = j | X_n = i, X_{n−1}, X_{n−2}, …) = P(X_{n+1} = j | X_n = i). The full description of a finite Markov chain is its transition matrix P, where P_ij is the probability of moving from state i to state j. Each row of P sums to 1 — it is a stochastic matrix. The probability distribution after n steps is given by repeated multiplication: π_n = π₀·Pⁿ, which means the long-run behavior of a Markov chain is governed by the powers of P, and therefore by its eigenvalues. Under mild conditions — irreducibility (every state reachable from every other) and aperiodicity — there is a unique stationary distribution π satisfying πP = π, the left eigenvector of P with eigenvalue 1, and every starting distribution converges to π as n → ∞. The convergence rate is set by the second-largest eigenvalue of P. Generalizations multiply: continuous-time Markov processes, Markov decision processes (Markov chains plus actions and rewards, the foundation of reinforcement learning), hidden Markov models (chains where the states are hidden and only probabilistically-related outputs are observed), Markov random fields (chains generalized to graph structures).

Why nowPageRank — the algorithm that founded Google — models web surfing as a Markov chain on the link graph and ranks pages by the stationary distribution. MCMC (Markov Chain Monte Carlo) — Metropolis-Hastings, Gibbs sampling, the Hamiltonian and No-U-Turn samplers — is the workhorse computational technique of modern Bayesian inference, used everywhere from cosmology to drug development. Speech recognition and part-of-speech tagging historically used hidden Markov models. Queuing theory in operations research is built on Markov chains. Reinforcement learning — every Atari-playing AI, AlphaGo, AlphaStar — operates over Markov decision processes. Asset pricing models in finance use Markov-structured noise. The little structural assumption Markov made in 1906 is now one of the most reused models in applied science.

Further readingFor a clean introduction with applications, Norris's Markov Chains (1997) is concise and excellent. The classical reference is Kemeny and Snell's Finite Markov Chains (1960). For the modern MCMC use case, Robert and Casella's Monte Carlo Statistical Methods (2004) is the standard. Diaconis's papers on card shuffling — and his book chapter in Group Representations in Probability and Statistics — are where the deepest intuition lives.