Shannon Entropy

How much surprise a distribution carries — the bits-per-symbol that no compression scheme can beat.

Suggested next → Public-Key Cryptography · CS·AI

The brief

In July 1948, the Bell System Technical Journal published a long paper titled A Mathematical Theory of Communication by a thirty-two-year-old engineer named Claude Shannon. The paper was technical, dense, and instantly recognized as one of the most important publications in twentieth-century science. Shannon was working on a question that, until he asked it precisely, nobody had known how to answer: what is information, mathematically? How few bits do you need to send a message reliably across a noisy channel? His answer founded the entire discipline of information theory, gave the world the modern definition of a bit, and unified — to the surprise of everyone working at the boundary — information with thermodynamic entropy.

Take a discrete random variable X with possible values x₁, x₂, …, xₙ and probabilities p₁, p₂, …, pₙ. Shannon defined the entropy of X as H(X) = -Σ pᵢ · log₂ pᵢ. The number, measured in bits, captures the average uncertainty in X — equivalently, the average number of yes/no questions needed to determine X's value optimally. The properties match intuition: H is zero if X is deterministic (no uncertainty); H is maximized by the uniform distribution (most uncertainty for a given alphabet); H is additive for independent variables (the uncertainty in a pair is the sum of uncertainties). Shannon's source coding theorem — the central theoretical result of compression — says that any lossless code for X requires at least H(X) bits per symbol on average, and codes approaching the limit exist (Huffman coding, arithmetic coding). His noisy-channel coding theorem says that every channel has a maximum reliable transmission rate, the channel capacity C, and that for any rate below C there exist codes (block codes, eventually turbo codes and LDPC codes) that achieve arbitrarily small error. The unification with thermodynamics is exact: Boltzmann's S = k_B · ln W and Shannon's H = -Σ p log p are the same quantity, differing only by units. Maxwell's demon — the nineteenth-century thought experiment about a being that violates the Second Law by sorting molecules — was finally resolved through Shannon's framework: the demon's information processing has an entropic cost that exactly compensates the entropy decrease it achieves. Landauer's principle (1961) made this concrete: erasing one bit of information costs at least k_B·T·ln 2 of energy, a thermodynamic floor on computation that modern chip designers are starting to bump against.

Why nowEvery compression algorithm — gzip, JPEG, MP3, H.265, the LLM tokenization that runs every modern AI — is engineered to approach Shannon's source-coding limit. Every communication system — Wi-Fi, 5G, GPS, deep-space probes, fiber optics — is designed against Shannon's channel-capacity bound, and modern codes (LDPC, polar codes) achieve it within fractions of a decibel. Cross-entropy loss is the standard objective function for training classifiers and language models — Shannon's H, dressed up. Genetic information is measured in Shannon bits. Neuroscience uses entropy to characterize the information capacity of neural codes. The 1948 paper is, by some measures, the most-cited mathematical paper of the twentieth century, and the conceptual move — that information has a precise quantitative meaning — is one of the foundations on which all of digital civilization rests.

Further readingCover and Thomas's Elements of Information Theory (2nd ed., 2006) is the standard reference, well-written enough to read straight through. For the original story, Shannon's 1948 paper A Mathematical Theory of Communication is short, lucid, and foundational. James Gleick's The Information (2011) is the popular history. MacKay's Information Theory, Inference, and Learning Algorithms (2003) is the cross-disciplinary treatment that bridges to machine learning and is freely available online.