Beyond Llms
Diffusion Models
Generative models that learn to reverse a noise-adding process: the math of score matching, denoising, SDEs, and why diffusion dominates image generation.
Prerequisites
Why This Matters
Diffusion models are the dominant generative architecture for images, video, audio, and molecular design as of 2024-2025. Stable Diffusion, DALL-E 3, Imagen, and Sora all use diffusion or its close relative, flow matching.
Unlike GANs, diffusion models have stable training (no minimax game). Unlike VAEs, they produce high-fidelity samples without blurry compromises. The mathematical framework connects deep learning to stochastic differential equations and gives a principled approach to density estimation.
Mental Model
Start with a clean image. Gradually add Gaussian noise over many steps until the image becomes pure noise. Now learn to reverse each step: given a noisy image, predict the slightly-less-noisy version. At generation time, start from pure noise and iteratively denoise to produce a clean image.
The key insight: you do not need to learn the reverse process from scratch. You can train on pairs (noisy image at step , noise that was added) because you control the forward process completely.
Formal Setup
Forward Process (Noise Schedule)
The forward process adds Gaussian noise over steps. Starting from data :
where is a variance schedule. After steps, .
Closed-Form Forward Sampling
Define and . Then you can sample any directly:
This is the reparameterization: , where .
Reverse Process
The reverse process is a learned Markov chain:
The neural network (or equivalently a noise predictor ) learns to undo each noise step.
Score Function
The score function of a distribution is the gradient of the log-density:
Score matching trains a model without knowing explicitly.
Main Theorems
DDPM Training Objective
Statement
The negative log-likelihood is upper bounded by:
When the noise predictor is parameterized as , the simplified training loss reduces to:
Intuition
Training a diffusion model reduces to: sample a random timestep , add the corresponding amount of noise to a clean image, and train the network to predict . The simplified loss drops KL weighting terms but works better in practice.
Proof Sketch
Start from the standard variational bound on . Because the forward process is Gaussian and its posteriors are tractable Gaussians, each KL term becomes a squared error between predicted and true denoising means. Reparameterize the mean prediction as noise prediction to get the simplified loss.
Why It Matters
This is why diffusion training is so simple in practice: just predict the noise. The variational framework guarantees you are optimizing a valid bound on log-likelihood, but the simplified loss ignores weighting factors and still produces excellent samples.
Failure Mode
The simplified loss underweights certain timesteps (particularly high-noise steps) which matters for likelihood evaluation. If you care about exact log-likelihoods rather than sample quality, use the weighted objective.
Denoising Score Matching
Statement
Training a neural network to denoise Gaussian-corrupted data is equivalent to learning the score function of the noise-perturbed distribution. Formally, for perturbation kernel :
where is independent of .
Intuition
You cannot compute because you do not know . But if you add known noise, the denoising direction is the score of the noisy distribution. Predicting the noise that was added is the same as learning the score.
Why It Matters
This theorem unifies the denoising perspective (DDPM) with the score-matching perspective (score-based generative models). They are the same thing. The noise predictor and the score model are related by a simple rescaling.
The SDE Perspective
The discrete steps of DDPM are a discretization of a continuous stochastic differential equation. The forward SDE adds noise continuously:
The reverse SDE (Anderson, 1982) runs time backward:
The score is the only unknown, and it is exactly what the denoising network learns. This SDE framework allows adaptive step sizes, probability flow ODEs for deterministic sampling, and exact likelihood computation.
Classifier-Free Guidance
Conditional generation (text-to-image) uses classifier-free guidance: train the model with and without conditioning , then at sampling time extrapolate:
where is the guidance scale. This amplifies the influence of the conditioning signal. Higher gives more faithful but less diverse samples.
Flow Matching
Flow Matching with Optimal Transport Paths
Statement
Instead of learning to reverse a stochastic noising process, flow matching learns a velocity field that transports samples from noise to data along straight (optimal transport) paths. The training objective is:
where is the linear interpolation.
Intuition
Diffusion takes curved, noisy paths from data to noise and back. Flow matching takes straight lines. Straighter paths mean fewer integration steps at generation time, which means faster sampling.
Why It Matters
Flow matching is emerging as a simpler, faster alternative to diffusion. It avoids the noise schedule design problem entirely. Stable Diffusion 3 and many recent models use flow matching instead of DDPM.
- Image generation: Stable Diffusion, DALL-E 3, Imagen, Midjourney
- Video generation: Sora (OpenAI), video world models
- Audio synthesis: AudioLDM, diffusion-based music generation
- Molecular design: diffusion for protein structure, drug molecules
- 3D generation: DreamFusion, point cloud diffusion, and Gaussian splatting representations
- Robotics: diffusion policies for action generation
Common Confusions
Diffusion is not just 'reverse noise'
The naive description. "add noise, then learn to remove it". obscures the actual mathematics. The model learns the score function (gradient of log-density) of the noisy distribution at each noise level. Denoising is a consequence, not the definition. The SDE framework makes this precise: sampling is solving a reverse-time SDE using the learned score.
More steps is not always better
DDPM originally used 1000 denoising steps. Modern samplers (DDIM, DPM-Solver, consistency models) reduce this to 4-50 steps with minimal quality loss. The number of steps is a sampler design choice, not a fundamental property of the model.
Guidance scale is not a quality dial
Increasing classifier-free guidance scale makes outputs more aligned with the prompt but reduces diversity and introduces artifacts at high values. There is a sweet spot (typically for image models) beyond which quality degrades.
Summary
- Forward process: gradually add Gaussian noise until data becomes pure noise
- Reverse process: learn to denoise step by step (or equivalently, learn the score function)
- Training reduces to predicting the noise added at each step
- The SDE perspective unifies discrete (DDPM) and continuous formulations
- Classifier-free guidance enables conditional generation by interpolating between conditional and unconditional predictions
- Flow matching generalizes diffusion with straighter transport paths and simpler training
Exercises
Problem
Given (i.e., 99% noise), write the expression for in terms of and . How much of the original signal remains?
Problem
Show that the denoising score matching objective (predicting noise ) and the score matching objective (learning ) are equivalent up to a constant and a rescaling. What is the exact relationship between and ?
Problem
Flow matching uses straight interpolation paths . What happens if you use curved paths instead (e.g., geodesics on a data manifold)? When might this matter?
Related Comparisons
References
Canonical:
- Ho, Jain, Abbeel, "Denoising Diffusion Probabilistic Models" (NeurIPS 2020)
- Song et al., "Score-Based Generative Modeling through SDEs" (ICLR 2021)
Current:
-
Lipman et al., "Flow Matching for Generative Modeling" (ICLR 2023)
-
Ho & Salimans, "Classifier-Free Diffusion Guidance" (NeurIPS Workshop 2022)
-
Goodfellow, Bengio, Courville, Deep Learning (2016), Chapters 14-20
-
Zhang et al., Dive into Deep Learning (2023), Chapters 14-17
Next Topics
The natural next steps from diffusion models:
- Mamba and state-space models: alternative architectures for sequence generation beyond the transformer/diffusion paradigm
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Variational AutoencodersLayer 3
- AutoencodersLayer 2
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A
- Vectors, Matrices, and Linear MapsLayer 0A
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
Builds on This
- Flow MatchingLayer 4
- Video World ModelsLayer 5