In 1906, the Russian mathematician Andrei Markov — already famous for his quarrels with the Russian Orthodox Church and the Tsarist government — published a paper that picked a fight of a different kind. Most of probability theory at that point assumed independence: that successive trials were unconnected, like repeated coin flips. Markov wanted to prove the law of large numbers without the independence assumption, and he proposed the right relaxation: let each trial depend on the immediately preceding one but on nothing further back. The first application he chose was almost mischievous — the distribution of vowels and consonants in Pushkin's Eugene Onegin — and it worked. The structure he uncovered, now called a Markov chain, turned out to describe a startling number of natural and computational processes.
A Markov chain is a sequence of random variables X₁, X₂, X₃, … taking values in a state space S, with the Markov property: the conditional distribution of the next state depends only on the current state, not on the history. Formally, P(X_{n+1} = j | X_n = i, X_{n−1}, X_{n−2}, …) = P(X_{n+1} = j | X_n = i). The full description of a finite Markov chain is its transition matrix P, where P_ij is the probability of moving from state i to state j. Each row of P sums to 1 — it is a stochastic matrix. The probability distribution after n steps is given by repeated multiplication: π_n = π₀·Pⁿ, which means the long-run behavior of a Markov chain is governed by the powers of P, and therefore by its eigenvalues. Under mild conditions — irreducibility (every state reachable from every other) and aperiodicity — there is a unique stationary distribution π satisfying πP = π, the left eigenvector of P with eigenvalue 1, and every starting distribution converges to π as n → ∞. The convergence rate is set by the second-largest eigenvalue of P. Generalizations multiply: continuous-time Markov processes, Markov decision processes (Markov chains plus actions and rewards, the foundation of reinforcement learning), hidden Markov models (chains where the states are hidden and only probabilistically-related outputs are observed), Markov random fields (chains generalized to graph structures).
PageRank — the algorithm that founded Google — models web surfing as a Markov chain on the link graph and ranks pages by the stationary distribution. MCMC (Markov Chain Monte Carlo) — Metropolis-Hastings, Gibbs sampling, the Hamiltonian and No-U-Turn samplers — is the workhorse computational technique of modern Bayesian inference, used everywhere from cosmology to drug development. Speech recognition and part-of-speech tagging historically used hidden Markov models. Queuing theory in operations research is built on Markov chains. Reinforcement learning — every Atari-playing AI, AlphaGo, AlphaStar — operates over Markov decision processes. Asset pricing models in finance use Markov-structured noise. The little structural assumption Markov made in 1906 is now one of the most reused models in applied science.