Diffusion Models

Generate by reversing the slow random walk that turned signal into noise.

Suggested next → Neural Networks · CS·AI

The brief

In 2015 a Stanford team published a paper almost no one read, with an unlikely proposal borrowed from physics: to build a model that creates images, first destroy one. Take a photograph and add a little random noise, then a little more, and again, thousands of times, until nothing is left but static — then train a neural network to run the process backward, removing a touch of noise at each step. Do it well enough, and you can start from pure static and watch a coherent image resolve out of the fog, as if developing in reverse. The idea sat ignored for five years. By 2022, refined and scaled up, it powered Stable Diffusion, DALL·E, and Midjourney, and the world of image-making was suddenly unrecognizable.

The elegance is in how little the model is ever asked to learn. The forward half — adding noise step by step until a picture dissolves into static — is fixed in advance, with nothing to train; it is just a controlled slide into randomness. All the network learns is the reverse of a single small step: given a slightly noisy image, guess what noise was just added, so that it can be subtracted. Training could hardly be plainer — corrupt an image by a random amount, ask the network to name the noise, nudge it toward the right answer, and repeat across millions of pictures. There is no adversary to balance, none of the instability that made the previous generation of image models so temperamental; the task is just patient denoising. To generate something new, you chain those small reverse steps from pure noise all the way down to a clean image. Steering it is a matter of feeding in a text description that tilts each denoising step toward pictures matching the words, which is how a typed prompt becomes a scene. What is striking is that none of this is loose analogy with physics — it is the same mathematics. The forward corruption is literally a diffusion process, the very equations that describe a drop of ink spreading through water or heat leaking through metal, and the network is learning to run that diffusion backward. A technique lifted straight from nineteenth-century thermodynamics has turned out to be the most powerful way yet found to conjure images — and increasingly video, sound, and even the folded shapes of proteins — out of nothing but noise.

Why nowAlmost the entire generative-media industry now runs on this one idea: the leading image, video, and music generators are all diffusion underneath, and AlphaFold 3 even bolted a diffusion step onto protein prediction to place atoms in space. Research is racing to make it faster, since the original method needs hundreds of denoising steps per image and newer variants cut that to a handful. Around it swirl the defining fights of generative AI — lawsuits over training on copyrighted images, and content-authentication standards meant to mark what is synthetic. A quiet 2015 physics paper has become the engine under nearly every creative-AI product in use today.

Further readingSong & Ermon, Generative Modeling by Estimating Gradients of the Data Distribution (2019). Ho, Jain, & Abbeel, Denoising Diffusion Probabilistic Models (2020). Sohl-Dickstein et al., Deep Unsupervised Learning Using Nonequilibrium Thermodynamics (2015). Lilian Weng's blog post What Are Diffusion Models? (2021).