Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Beyond Llms

Diffusion Models

Generative models that learn to reverse a noise-adding process: the math of score matching, denoising, SDEs, and why diffusion dominates image generation.

AdvancedTier 2Current~70 min

Why This Matters

Diffusion models are the dominant generative architecture for images, video, audio, and molecular design as of 2024-2025. Stable Diffusion, DALL-E 3, Imagen, and Sora all use diffusion or its close relative, flow matching.

Unlike GANs, diffusion models have stable training (no minimax game). Unlike VAEs, they produce high-fidelity samples without blurry compromises. The mathematical framework connects deep learning to stochastic differential equations and gives a principled approach to density estimation.

Mental Model

Start with a clean image. Gradually add Gaussian noise over many steps until the image becomes pure noise. Now learn to reverse each step: given a noisy image, predict the slightly-less-noisy version. At generation time, start from pure noise and iteratively denoise to produce a clean image.

The key insight: you do not need to learn the reverse process from scratch. You can train on pairs (noisy image at step tt, noise that was added) because you control the forward process completely.

Formal Setup

Definition

Forward Process (Noise Schedule)

The forward process adds Gaussian noise over TT steps. Starting from data x0q(x0)\mathbf{x}_0 \sim q(\mathbf{x}_0):

q(xtxt1)=N(xt;1βtxt1,  βtI)q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t}\,\mathbf{x}_{t-1},\; \beta_t \mathbf{I})

where β1,,βT\beta_1, \ldots, \beta_T is a variance schedule. After TT steps, xTN(0,I)\mathbf{x}_T \approx \mathcal{N}(\mathbf{0}, \mathbf{I}).

Definition

Closed-Form Forward Sampling

Define αt=1βt\alpha_t = 1 - \beta_t and αˉt=s=1tαs\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s. Then you can sample any xt\mathbf{x}_t directly:

q(xtx0)=N(xt;αˉtx0,  (1αˉt)I)q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0,\; (1 - \bar{\alpha}_t)\mathbf{I})

This is the reparameterization: xt=αˉtx0+1αˉtϵ\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\,\boldsymbol{\epsilon}, where ϵN(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}).

Definition

Reverse Process

The reverse process is a learned Markov chain:

pθ(xt1xt)=N(xt1;μθ(xt,t),  σt2I)p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t),\; \sigma_t^2 \mathbf{I})

The neural network μθ\boldsymbol{\mu}_\theta (or equivalently a noise predictor ϵθ\boldsymbol{\epsilon}_\theta) learns to undo each noise step.

Definition

Score Function

The score function of a distribution p(x)p(\mathbf{x}) is the gradient of the log-density:

s(x)=xlogp(x)\mathbf{s}(\mathbf{x}) = \nabla_{\mathbf{x}} \log p(\mathbf{x})

Score matching trains a model sθ(x)xlogp(x)\mathbf{s}_\theta(\mathbf{x}) \approx \nabla_{\mathbf{x}} \log p(\mathbf{x}) without knowing p(x)p(\mathbf{x}) explicitly.

Main Theorems

Theorem

DDPM Training Objective

Statement

The negative log-likelihood is upper bounded by:

logpθ(x0)Eq ⁣[DKL(q(xTx0)p(xT))prior matching+t=2TDKL(q(xt1xt,x0)pθ(xt1xt))denoising matchinglogpθ(x0x1)reconstruction]-\log p_\theta(\mathbf{x}_0) \leq \mathbb{E}_q\!\Big[\underbrace{D_{\mathrm{KL}}(q(\mathbf{x}_T|\mathbf{x}_0) \| p(\mathbf{x}_T))}_{\text{prior matching}} + \sum_{t=2}^{T} \underbrace{D_{\mathrm{KL}}(q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) \| p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t))}_{\text{denoising matching}} - \underbrace{\log p_\theta(\mathbf{x}_0|\mathbf{x}_1)}_{\text{reconstruction}}\Big]

When the noise predictor is parameterized as ϵθ(xt,t)\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t), the simplified training loss reduces to:

Lsimple=Et,x0,ϵ ⁣[ϵϵθ(xt,t)2]\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}\!\big[\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\|^2\big]

Intuition

Training a diffusion model reduces to: sample a random timestep tt, add the corresponding amount of noise ϵ\boldsymbol{\epsilon} to a clean image, and train the network to predict ϵ\boldsymbol{\epsilon}. The simplified loss drops KL weighting terms but works better in practice.

Proof Sketch

Start from the standard variational bound on logpθ(x0)-\log p_\theta(\mathbf{x}_0). Because the forward process is Gaussian and its posteriors q(xt1xt,x0)q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0) are tractable Gaussians, each KL term becomes a squared error between predicted and true denoising means. Reparameterize the mean prediction as noise prediction to get the simplified loss.

Why It Matters

This is why diffusion training is so simple in practice: just predict the noise. The variational framework guarantees you are optimizing a valid bound on log-likelihood, but the simplified loss ignores weighting factors and still produces excellent samples.

Failure Mode

The simplified loss underweights certain timesteps (particularly high-noise steps) which matters for likelihood evaluation. If you care about exact log-likelihoods rather than sample quality, use the weighted objective.

Theorem

Denoising Score Matching

Statement

Training a neural network to denoise Gaussian-corrupted data is equivalent to learning the score function of the noise-perturbed distribution. Formally, for perturbation kernel qσ(x~x)=N(x~;x,σ2I)q_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) = \mathcal{N}(\tilde{\mathbf{x}}; \mathbf{x}, \sigma^2 \mathbf{I}):

Eqσ(x~,x) ⁣[sθ(x~)x~logqσ(x~x)2]=Eqσ(x~) ⁣[sθ(x~)x~logqσ(x~)2]+C\mathbb{E}_{q_\sigma(\tilde{\mathbf{x}}, \mathbf{x})}\!\big[\|\mathbf{s}_\theta(\tilde{\mathbf{x}}) - \nabla_{\tilde{\mathbf{x}}} \log q_\sigma(\tilde{\mathbf{x}}|\mathbf{x})\|^2\big] = \mathbb{E}_{q_\sigma(\tilde{\mathbf{x}})}\!\big[\|\mathbf{s}_\theta(\tilde{\mathbf{x}}) - \nabla_{\tilde{\mathbf{x}}} \log q_\sigma(\tilde{\mathbf{x}})\|^2\big] + C

where CC is independent of θ\theta.

Intuition

You cannot compute xlogp(x)\nabla_{\mathbf{x}} \log p(\mathbf{x}) because you do not know pp. But if you add known noise, the denoising direction is the score of the noisy distribution. Predicting the noise that was added is the same as learning the score.

Why It Matters

This theorem unifies the denoising perspective (DDPM) with the score-matching perspective (score-based generative models). They are the same thing. The noise predictor ϵθ\boldsymbol{\epsilon}_\theta and the score model sθ\mathbf{s}_\theta are related by a simple rescaling.

The SDE Perspective

The discrete steps of DDPM are a discretization of a continuous stochastic differential equation. The forward SDE adds noise continuously:

dx=f(x,t)dt+g(t)dwd\mathbf{x} = \mathbf{f}(\mathbf{x}, t)\,dt + g(t)\,d\mathbf{w}

The reverse SDE (Anderson, 1982) runs time backward:

dx=[f(x,t)g(t)2xlogpt(x)]dt+g(t)dwˉd\mathbf{x} = [\mathbf{f}(\mathbf{x}, t) - g(t)^2 \nabla_{\mathbf{x}} \log p_t(\mathbf{x})]\,dt + g(t)\,d\bar{\mathbf{w}}

The score xlogpt(x)\nabla_{\mathbf{x}} \log p_t(\mathbf{x}) is the only unknown, and it is exactly what the denoising network learns. This SDE framework allows adaptive step sizes, probability flow ODEs for deterministic sampling, and exact likelihood computation.

Classifier-Free Guidance

Conditional generation (text-to-image) uses classifier-free guidance: train the model with and without conditioning cc, then at sampling time extrapolate:

ϵ~θ(xt,c)=(1+w)ϵθ(xt,c)wϵθ(xt)\tilde{\boldsymbol{\epsilon}}_\theta(\mathbf{x}_t, c) = (1 + w)\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, c) - w\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t)

where w>0w > 0 is the guidance scale. This amplifies the influence of the conditioning signal. Higher ww gives more faithful but less diverse samples.

Flow Matching

Proposition

Flow Matching with Optimal Transport Paths

Statement

Instead of learning to reverse a stochastic noising process, flow matching learns a velocity field vθ(x,t)\mathbf{v}_\theta(\mathbf{x}, t) that transports samples from noise to data along straight (optimal transport) paths. The training objective is:

LFM=Et,x0,ϵ ⁣[vθ(xt,t)(x0ϵ)2]\mathcal{L}_{\text{FM}} = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}\!\big[\|\mathbf{v}_\theta(\mathbf{x}_t, t) - (\mathbf{x}_0 - \boldsymbol{\epsilon})\|^2\big]

where xt=(1t)ϵ+tx0\mathbf{x}_t = (1-t)\boldsymbol{\epsilon} + t\,\mathbf{x}_0 is the linear interpolation.

Intuition

Diffusion takes curved, noisy paths from data to noise and back. Flow matching takes straight lines. Straighter paths mean fewer integration steps at generation time, which means faster sampling.

Why It Matters

Flow matching is emerging as a simpler, faster alternative to diffusion. It avoids the noise schedule design problem entirely. Stable Diffusion 3 and many recent models use flow matching instead of DDPM.

Where This Shows Up in Current Papers
  • Image generation: Stable Diffusion, DALL-E 3, Imagen, Midjourney
  • Video generation: Sora (OpenAI), video world models
  • Audio synthesis: AudioLDM, diffusion-based music generation
  • Molecular design: diffusion for protein structure, drug molecules
  • 3D generation: DreamFusion, point cloud diffusion, and Gaussian splatting representations
  • Robotics: diffusion policies for action generation

Common Confusions

Watch Out

Diffusion is not just 'reverse noise'

The naive description. "add noise, then learn to remove it". obscures the actual mathematics. The model learns the score function (gradient of log-density) of the noisy distribution at each noise level. Denoising is a consequence, not the definition. The SDE framework makes this precise: sampling is solving a reverse-time SDE using the learned score.

Watch Out

More steps is not always better

DDPM originally used 1000 denoising steps. Modern samplers (DDIM, DPM-Solver, consistency models) reduce this to 4-50 steps with minimal quality loss. The number of steps is a sampler design choice, not a fundamental property of the model.

Watch Out

Guidance scale is not a quality dial

Increasing classifier-free guidance scale ww makes outputs more aligned with the prompt but reduces diversity and introduces artifacts at high values. There is a sweet spot (typically w[5,15]w \in [5, 15] for image models) beyond which quality degrades.

Summary

  • Forward process: gradually add Gaussian noise until data becomes pure noise
  • Reverse process: learn to denoise step by step (or equivalently, learn the score function)
  • Training reduces to predicting the noise added at each step
  • The SDE perspective unifies discrete (DDPM) and continuous formulations
  • Classifier-free guidance enables conditional generation by interpolating between conditional and unconditional predictions
  • Flow matching generalizes diffusion with straighter transport paths and simpler training

Exercises

ExerciseCore

Problem

Given αˉt=0.01\bar{\alpha}_t = 0.01 (i.e., 99% noise), write the expression for xt\mathbf{x}_t in terms of x0\mathbf{x}_0 and ϵ\boldsymbol{\epsilon}. How much of the original signal remains?

ExerciseAdvanced

Problem

Show that the denoising score matching objective (predicting noise ϵ\boldsymbol{\epsilon}) and the score matching objective (learning xlogpt(x)\nabla_{\mathbf{x}} \log p_t(\mathbf{x})) are equivalent up to a constant and a rescaling. What is the exact relationship between ϵθ\boldsymbol{\epsilon}_\theta and sθ\mathbf{s}_\theta?

ExerciseResearch

Problem

Flow matching uses straight interpolation paths xt=(1t)ϵ+tx0\mathbf{x}_t = (1-t)\boldsymbol{\epsilon} + t\,\mathbf{x}_0. What happens if you use curved paths instead (e.g., geodesics on a data manifold)? When might this matter?

Related Comparisons

References

Canonical:

  • Ho, Jain, Abbeel, "Denoising Diffusion Probabilistic Models" (NeurIPS 2020)
  • Song et al., "Score-Based Generative Modeling through SDEs" (ICLR 2021)

Current:

  • Lipman et al., "Flow Matching for Generative Modeling" (ICLR 2023)

  • Ho & Salimans, "Classifier-Free Diffusion Guidance" (NeurIPS Workshop 2022)

  • Goodfellow, Bengio, Courville, Deep Learning (2016), Chapters 14-20

  • Zhang et al., Dive into Deep Learning (2023), Chapters 14-17

Next Topics

The natural next steps from diffusion models:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics