PolymathicAll ideas →
Computer Science & AI

Diffusion Models

Generate by reversing the slow random walk that turned signal into noise.

In 2015, Sohl-Dickstein and colleagues at Stanford published a curious paper, Deep Unsupervised Learning Using Nonequilibrium Thermodynamics, proposing that you could train a generative model by reversing a slow random-noise corruption process: take an image, gradually add Gaussian noise until it becomes pure static, then train a neural network to undo the corruption one step at a time. To generate a new image, start from pure noise and run the reverse process forward. The paper was largely ignored for five years until Ho et al. (2020) cleaned up the formulation as Denoising Diffusion Probabilistic Models and Song and Ermon connected it to score matching. By 2022 Stable Diffusion, DALL·E 2, and Midjourney had launched and the image-generation landscape was unrecognisable.

Diffusion models are generative models that learn to reverse a fixed noise-injection process. The forward process takes a clean data sample x₀ and progressively adds Gaussian noise over T steps (typically T = 1000), producing x_T which is approximately pure noise; this process is fixed and has no learnable parameters. The reverse process — what the model has to learn — predicts x_{t−1} from x_t, with a neural network ε_θ(x_t, t) trained to estimate the noise added at step t. Training is remarkably simple: sample a clean image, sample a timestep, add the corresponding noise, ask the network to predict the noise, minimize MSE. No adversarial training, no mode-collapse pathologies, no balancing two networks against each other as in GANs. At sampling time you start from pure noise and repeatedly apply the reverse process to end with a sample from the data distribution. The score-based interpretation shows the network is equivalently learning ∇_x log p(x_t), the gradient of the log-likelihood; sampling becomes Langevin dynamics on the learned score. Conditioning on text adds a text encoder (CLIP or T5) feeding cross-attention layers; classifier-free guidance trades sample diversity for fidelity; latent diffusion operates in a compressed VAE-latent space rather than in pixel space, reducing compute by ~10×. The denoising network is typically a U-Net for images or a transformer (DiT, Sora) for video. The kinship to non-equilibrium statistical physics is not metaphorical — the noise schedule corresponds to a Fokker-Planck equation, the score function to thermodynamic forces — and the technique now drives audio (MusicLM), video (Sora, Veo), protein-structure prediction (AlphaFold 3), and molecular design.

Why it matters now

The image-generation industry runs almost entirely on diffusion (Midjourney, Stable Diffusion, DALL·E, Imagen, Firefly, Flux); video generation (Sora, Veo, Kling, Runway) is diffusion-based; music generation (Suno, Udio) is diffusion-based; AlphaFold 3 (2024) added a diffusion head to enable prediction of protein-ligand complexes. Open research frontiers include flow matching and rectified flow as simplified successors that achieve comparable results with fewer sampling steps, and consistency models that generate samples in 1–4 steps instead of 50–1000. Copyright and labour litigation around image-generation models is active (Getty Images v. Stability AI, NYT v. OpenAI), and deepfake concerns have prompted content-authentication standards (C2PA, watermarking). A 2015 thermodynamics paper is now the generative substrate of nearly every commercial creative-AI product.

Further readingSong & Ermon, Generative Modeling by Estimating Gradients of the Data Distribution (2019). Ho, Jain, & Abbeel, Denoising Diffusion Probabilistic Models (2020). Sohl-Dickstein et al., Deep Unsupervised Learning Using Nonequilibrium Thermodynamics (2015). Lilian Weng's blog post What Are Diffusion Models? (2021).
Read it in Polymathic →Browse the catalogue
Polymathic — a curated catalogue of the ideas worth keeping across twelve disciplines. polymathic.app