Hypothesis Testing

A null world, a p-value, and the bargain that you'll be wrong 5% of the time on purpose.

Suggested next → Falsifiability · PHIL

The brief

Around 1922, at the Rothamsted Experimental Station in England, a colleague of Ronald Fisher's claimed she could tell, by taste, whether milk had been added to a teacup before or after the tea. Fisher — who would shortly become the most influential statistician of the twentieth century — designed the experiment to settle it. Eight cups: four with milk first, four with tea first, presented in random order. He asked: if she had no real ability and were guessing, how often would she correctly identify all eight? The answer is once in seventy times — a p-value of about 0.014. She got all eight right. The episode became the lady tasting tea, and the framework Fisher built around it — null-hypothesis significance testing — became the standard methodology of empirical science.

A hypothesis test is a procedure for deciding, from data, whether to reject a stated null hypothesis H₀ (typically: "no effect," "no difference," "the coin is fair"). The recipe: choose a test statistic whose distribution under H₀ is known; collect data and compute the statistic; compute the p-value — the probability, if H₀ were true, of observing a statistic as extreme as the one you saw or more so; if p < α (a pre-chosen significance level, conventionally 0.05), reject H₀. Otherwise, fail to reject. The framework was systematized by Fisher in the 1920s and refined by Jerzy Neyman and Egon Pearson into the formal decision-theoretic version that introduces alternative hypotheses and power. Type I error: rejecting a true H₀ (false positive); the rate is α. Type II error: failing to reject a false H₀ (false negative); the rate is β. Power = 1 − β. Multiple testing — running many tests at once — inflates the false-positive rate; remedies include the Bonferroni correction and false discovery rate control. The p-value, the most-used and most-misunderstood number in science, is not the probability that H₀ is true; it is the probability of the data given H₀, which is a different and frequently confused thing. The framework has come under sustained attack since the 2010s. p-hacking — running many tests and reporting only the significant ones — produced much of the replication crisis in psychology and biomedicine; estimated reproducibility rates in some subfields are below 50%. Responses include pre-registration of analyses, larger samples, abandoning the strict 0.05 threshold, and Bayesian alternatives that report posterior probabilities directly.

Why nowPharmaceutical clinical trials are formally hypothesis tests against placebo, with regulatory frameworks structured around α and β. A/B testing in technology companies — the basis of every "this button is now blue" decision at scale — is hypothesis testing on user metrics, with millions of micro-experiments running daily. Particle physics uses extreme thresholds (the 5-sigma standard — p ≈ 3 × 10⁻⁷) to claim a discovery; the Higgs boson was announced when a peak in CERN data crossed this line in 2012. The framework's flaws are now widely acknowledged in the scientific community, but its replacement remains contested — Bayesian methods, effect-size reporting, and pre-registration are all gaining ground without yet displacing the p-value's central role.

Further readingFor working scientists, Wasserman's All of Statistics covers the apparatus efficiently; Casella and Berger goes deeper. The replication-crisis context is best read in Andrew Gelman and Eric Loken's 2013 essay The Garden of Forking Paths and Gelman's Statistical Rethinking — the latter actually by Richard McElreath (2nd ed., 2020), which builds the Bayesian alternative from scratch. Deborah Mayo's Statistical Inference as Severe Testing (2018) is the most thoughtful modern defence of the frequentist approach.