Skip to main content

ML Methods

Score Matching

Hyvärinen 2005: train a model to estimate ∇ log p(x) without ever computing the normalization constant. Integration by parts converts the intractable density-matching loss into a tractable gradient-based objective. The training objective that powers every score-based diffusion model and every modern energy-based model.

AdvancedTier 2Stable~50 min
0

Why This Matters

Most probabilistic models cannot be trained by maximum likelihood because the normalization constant Z=eEθ(x)dxZ = \int e^{-E_\theta(x)}\,dx is intractable. Energy-based models, score-based diffusion, and most generative samplers all face this obstacle. Score matching (Hyvärinen 2005) is the workaround: instead of fitting the density pθ(x)p_\theta(x) directly, fit the score xlogp(x)\nabla_x \log p(x). The score is invariant to the partition function — xlogZ=0\nabla_x \log Z = 0 — so matching scores never requires knowing ZZ.

Hyvärinen's 2005 paper turned this idea into a tractable loss via integration by parts: the squared-distance objective between model and data scores reduces, up to a constant, to a quantity computable from samples and a single Hessian trace. Vincent (2011) replaced the Hessian trace with a denoising target, giving the denoising score matching loss that scales to high dimensions and underlies every modern diffusion model. Song and Ermon (2019) added multiple noise scales and turned the loss into the noise-conditional score network (NCSN); Song et al. (2021) wrapped the entire setup inside a forward noising SDE and used Anderson's time-reversal theorem to convert the learned score into a generative sampler.

The thread from Hyvärinen 2005 to Stable Diffusion is straight: every score-based generative model is doing denoising score matching at multiple noise levels, then plugging the learned score into a reverse SDE. If you understand score matching, you understand the training half of diffusion.

Mental Model

The score s(x):=xlogp(x)s(x) := \nabla_x \log p(x) is a vector field on Rd\mathbb{R}^d that points from low-density regions toward high-density regions. It is the direction of steepest log-density increase and the gradient field that Langevin dynamics follows to sample from pp.

Score matching trains a parametric vector field sθ(x)Rds_\theta(x) \in \mathbb{R}^d to approximate this score field. The natural loss is the squared L2L^2 error between data and model scores, weighted by the data density:

JESM(θ)=12Expdata[sθ(x)xlogpdata(x)2].J_{\text{ESM}}(\theta) = \tfrac{1}{2}\, \mathbb{E}_{x \sim p_{\text{data}}} \big[\, \|s_\theta(x) - \nabla_x \log p_{\text{data}}(x)\|^2\, \big].

This is the explicit score matching loss. It is intractable because we do not know xlogpdata\nabla_x \log p_{\text{data}}. Hyvärinen's trick is integration by parts: under mild boundary conditions, the explicit loss equals (up to a constant in θ\theta) a tractable implicit loss that only requires samples and the Jacobian of sθs_\theta.

Hyvärinen's Implicit Score Matching

Definition

Implicit Score Matching Loss

The implicit score matching loss for a score model sθ:RdRds_\theta : \mathbb{R}^d \to \mathbb{R}^d is

JISM(θ)=Expdata[12sθ(x)2+tr ⁣(xsθ(x))],J_{\text{ISM}}(\theta) = \mathbb{E}_{x \sim p_{\text{data}}} \Big[\, \tfrac{1}{2}\, \|s_\theta(x)\|^2 + \mathrm{tr}\!\left(\nabla_x s_\theta(x)\right) \,\Big],

where xsθ(x)Rd×d\nabla_x s_\theta(x) \in \mathbb{R}^{d \times d} is the Jacobian of sθs_\theta at xx and tr(xsθ)=i=1dxi(sθ)i\mathrm{tr}(\nabla_x s_\theta) = \sum_{i=1}^d \partial_{x_i} (s_\theta)_i is its trace (the divergence of the score field).

The implicit loss is computable from data samples alone. There is no reference to xlogpdata\nabla_x \log p_{\text{data}} and no normalization constant.

Theorem

Hyvärinen's Implicit Score Matching Theorem

Statement

Under the assumptions above,

12Epdata[sθ(x)xlogpdata(x)2]=Epdata[12sθ(x)2+tr(xsθ(x))]+C(pdata),\tfrac{1}{2}\, \mathbb{E}_{p_{\text{data}}}\big[\|s_\theta(x) - \nabla_x \log p_{\text{data}}(x)\|^2\big] = \mathbb{E}_{p_{\text{data}}}\Big[\tfrac{1}{2} \|s_\theta(x)\|^2 + \mathrm{tr}(\nabla_x s_\theta(x))\Big] + C(p_{\text{data}}),

where C(pdata)=12Epdata[xlogpdata2]C(p_{\text{data}}) = \tfrac{1}{2}\, \mathbb{E}_{p_{\text{data}}} [\|\nabla_x \log p_{\text{data}}\|^2] is the data Fisher information, independent of θ\theta.

Intuition

The cross term Epdata[sθ(x)xlogpdata(x)]\mathbb{E}_{p_{\text{data}}}[s_\theta(x) \cdot \nabla_x \log p_{\text{data}}(x)] in the explicit loss is what we cannot evaluate. But sθlogp=sθ(p/p)s_\theta \cdot \nabla \log p = s_\theta \cdot (\nabla p / p), so sθ(p/p)pdx=sθpdx\int s_\theta \cdot (\nabla p / p)\, p\, dx = \int s_\theta \cdot \nabla p\, dx. Integration by parts converts sθpdx\int s_\theta \cdot \nabla p\, dx into (sθ)pdx=Epdata[tr(sθ)]-\int (\nabla \cdot s_\theta)\, p\, dx = -\mathbb{E}_{p_{\text{data}}} [\mathrm{tr}(\nabla s_\theta)]. The intractable density gradient becomes a tractable divergence of the model itself.

Proof Sketch

Expand the squared-distance loss: sθlogp2=sθ22sθlogp+logp2\|s_\theta - \nabla \log p\|^2 = \|s_\theta\|^2 - 2 s_\theta \cdot \nabla \log p + \|\nabla \log p\|^2. The third term integrates to C(pdata)C(p_{\text{data}}), independent of θ\theta. For the cross term, use logp=p/p\nabla \log p = \nabla p / p to write Ep[sθlogp]=sθpdx\mathbb{E}_p[s_\theta \cdot \nabla \log p] = \int s_\theta \cdot \nabla p \,dx. Apply integration by parts component-wise: (sθ)iipdx=pi(sθ)idx\int (s_\theta)_i \partial_i p \,dx = -\int p\, \partial_i (s_\theta)_i\, dx (boundary terms vanish by assumption). Summing over ii gives Ep[tr(sθ)]-\mathbb{E}_p[\mathrm{tr} (\nabla s_\theta)]. Substitute back into the expansion to obtain JESM=JISM+CJ_{\text{ESM}} = J_{\text{ISM}} + C.

Why It Matters

This single integration-by-parts identity lets you train a model of the score without ever knowing the data density, without normalization constants, and without samples from the model. You compute sθ2\|s_\theta\|^2 and the trace of the Jacobian on data samples and minimize their sum. For density estimation, sampling, and energy-based modeling, this is the foundational equivalence; every later method (denoising, sliced, multi-noise) is a way of estimating the same loss more efficiently. JESM(θ)=JISM(θ)+CJ_{\text{ESM}}(\theta) = J_{\text{ISM}}(\theta) + C, where CC does not depend on θ\theta. Minimizing JISMJ_{\text{ISM}} is equivalent to minimizing JESMJ_{\text{ESM}}.

Failure Mode

The Jacobian trace tr(xsθ)\mathrm{tr}(\nabla_x s_\theta) requires dd backward passes to compute exactly (one per coordinate of sθs_\theta). For d104d \sim 10^4 to 10610^6 (images, video, audio), this is prohibitive. The implicit loss is theoretically clean but computationally infeasible in high dimensions. This is the gap that denoising score matching (Vincent 2011) and sliced score matching (Song et al. 2020) were invented to close.

Denoising Score Matching

Theorem

Vincent's Denoising Score Matching Identity

Statement

Define the denoising score matching loss

JDSM(θ)=12Expdata,x~qσ(x)[sθ(x~)x~logqσ(x~x)2].J_{\text{DSM}}(\theta) = \tfrac{1}{2}\, \mathbb{E}_{x \sim p_{\text{data}},\, \tilde{x} \sim q_\sigma(\cdot | x)} \big[\, \|s_\theta(\tilde{x}) - \nabla_{ \tilde{x}} \log q_\sigma(\tilde{x} | x)\|^2\, \big].

Then JDSM(θ)=JESMqσ(θ)+C(qσ)J_{\text{DSM}}(\theta) = J_{\text{ESM}}^{q_\sigma}(\theta) + C(q_\sigma), where JESMqσJ_{\text{ESM}}^{q_\sigma} is the explicit score matching loss against the noisy marginal qσ(x~)q_\sigma(\tilde{x}), and CC is independent of θ\theta. Minimizing JDSMJ_{\text{DSM}} trains sθs_\theta to match the score of the noisy data distribution qσq_\sigma.

Intuition

For Gaussian noise x~=x+σϵ\tilde{x} = x + \sigma \epsilon, the conditional score is x~logqσ(x~x)=(x~x)/σ2\nabla_{\tilde{x}} \log q_\sigma(\tilde{x} | x) = -(\tilde{x} - x) / \sigma^2, which is just the negative noise direction scaled by 1/σ21/\sigma^2. The DSM loss reduces to Esθ(x~)+ϵ/σ2\mathbb{E}\|s_\theta(\tilde{x}) + \epsilon / \sigma\|^2 (up to constants), which is plain least-squares regression of the model output onto the noise that was added. The Jacobian trace disappears entirely; you only need a forward pass and a regression target.

Proof Sketch

Expand the explicit score matching loss against qσ(x~)q_\sigma(\tilde{x}) and substitute qσ(x~)=Expdata[qσ(x~x)]q_\sigma(\tilde{x}) = \mathbb{E}_{x \sim p_{\text{data}}} [q_\sigma(\tilde{x} | x)]. The cross term Eqσ[sθlogqσ]\mathbb{E}_{q_\sigma}[s_\theta \cdot \nabla \log q_\sigma] becomes Ex,x~[sθ(x~)x~logqσ(x~x)]\mathbb{E}_{x, \tilde{x}}[s_\theta (\tilde{x}) \cdot \nabla_{\tilde{x}} \log q_\sigma(\tilde{x} | x)] by Bayes rule and the identity x~qσ(x~)=Ex[x~qσ(x~x)]\nabla_{\tilde{x}} q_\sigma(\tilde{x}) = \mathbb{E}_x[\nabla_{\tilde{x}} q_\sigma(\tilde{x} | x)]. The remainder is algebraic completion of the square.

Why It Matters

Denoising score matching scales to image, audio, and video models because its loss is a single regression target per sample: no Jacobians, no divergences, no second-order quantities. This is why diffusion training is architecturally identical to supervised regression: input is a noisy image, target is the noise vector, loss is MSE. Every pixel-space and latent-space diffusion model in production trains exactly this objective across many noise scales. Without Vincent's identity, score-based generative modeling at scale would not exist. The score-matching loss against the noisy marginal qσq_\sigma equals (up to a θ\theta-independent constant) the regression loss between sθ(x~)s_\theta(\tilde{x}) and x~logqσ(x~x)\nabla_{\tilde{x}} \log q_\sigma(\tilde{x} \mid x).

Failure Mode

DSM trains the score of the noisy density qσq_\sigma, not the original pdatap_{\text{data}}. For small σ\sigma, the noisy score approximates the data score, but only away from the data support, since the noisy density smooths out the manifold structure of the data. Single-noise-level DSM therefore cannot generate clean samples; you need a schedule of noise levels (NCSN, Song-Ermon 2019; or a continuous SDE noising process, Song et al. 2021) and a score model conditioned on σ\sigma. The single-scale version is provably a bad sampler in high dimensions.

Sliced Score Matching and Tweedie's Formula

Sliced score matching (Song et al. 2020) replaces the Jacobian trace tr(sθ)\mathrm{tr}(\nabla s_\theta) with a stochastic estimator vsθvv^\top \nabla s_\theta\, v for vv sampled from a fixed distribution (e.g., Rademacher). This costs one Jacobian-vector product per sample instead of dd, restoring tractability without requiring noise. It is the natural choice when noise injection is undesirable (e.g., discrete data after relaxation).

Tweedie's formula gives the posterior-mean interpretation of the score: for x~=x+σϵ\tilde{x} = x + \sigma \epsilon with ϵN(0,I)\epsilon \sim \mathcal{N}(0, I),

E[xx~]=x~+σ2x~logqσ(x~).\mathbb{E}[x \mid \tilde{x}] = \tilde{x} + \sigma^2\, \nabla_{\tilde{x}} \log q_\sigma(\tilde{x}).

A trained denoising score network sθ(x~,σ)s_\theta(\tilde{x}, \sigma) is therefore equivalent to a posterior-mean denoiser via x^(x~,σ)=x~+σ2sθ(x~,σ)\hat{x}(\tilde{x}, \sigma) = \tilde{x} + \sigma^2\, s_\theta(\tilde{x}, \sigma). This identity is why "diffusion" and "denoising" are interchangeable framings: they both train the same network, parameterized differently. Karras et al. (2022) make this explicit in EDM: the network outputs a denoised estimate DθD_\theta, and the score is recovered as sθ=(Dθx~)/σ2s_\theta = (D_\theta - \tilde{x}) / \sigma^2.

Worked Example: Gaussian Score Matching

Take pdata=N(μ,Σ)p_{\text{data}} = \mathcal{N}(\mu, \Sigma) and a linear score model sθ(x)=A(xb)s_\theta(x) = A(x - b) for ARd×dA \in \mathbb{R}^{d \times d}, bRdb \in \mathbb{R}^d. The true score is logp(x)=Σ1(xμ)\nabla \log p(x) = -\Sigma^{-1}(x - \mu), so the optimum is A=Σ1A^* = -\Sigma^{-1}, b=μb^* = \mu.

Compute the implicit score matching loss: 12A(xb)2+tr(A)=12(xb)AA(xb)+tr(A)\tfrac{1}{2}\|A(x - b)\|^2 + \mathrm{tr}(A) = \tfrac{1}{2}(x - b)^\top A^\top A (x - b) + \mathrm{tr}(A). Take expectation under pdatap_{\text{data}}: E[12(xb)AA(xb)]=12tr(AAΣ)+12(μb)AA(μb)\mathbb{E}[\tfrac{1}{2}(x - b)^\top A^\top A (x - b)] = \tfrac{1}{2}\mathrm{tr} (A^\top A \Sigma) + \tfrac{1}{2}(\mu - b)^\top A^\top A (\mu - b).

Setting gradients to zero in bb gives b=μb^* = \mu. Setting gradients to zero in AA gives AΣ+I=0A^* \Sigma + I = 0, i.e., A=Σ1A^* = -\Sigma^{-1}. The implicit loss recovers the exact maximum-likelihood Gaussian fit, with no direct access to the density. This is the cleanest demonstration of the Hyvärinen identity in closed form.

Common Confusions

Watch Out

Score matching is NOT minimizing KL to the data distribution

Score matching minimizes the squared L2L^2 distance between score fields, not the KL divergence between distributions. The two objectives have different gradients and different optima in general. They agree at the global minimum (both achieve zero only when pθ=pdatap_\theta = p_{\text{data}}), but the geometry of the optimization landscape is different. In particular, score matching is consistent under the model being well-specified, but the bias-variance tradeoff in finite samples differs from MLE.

Watch Out

The Jacobian trace is unrelated to the model output's norm

Beginners sometimes confuse tr(sθ)\mathrm{tr}(\nabla s_\theta) with sθ2\|s_\theta\|^2 or with the Jacobian's Frobenius norm. The trace is the divergence of the score field — ixi(sθ)i\sum_i \partial_{x_i} (s_\theta)_i — which can be negative, zero, or positive independently of the score's magnitude. The implicit score matching loss has both terms, and the trace term is what makes the loss work; without it, the optimum would be sθ0s_\theta \equiv 0.

Watch Out

Denoising score matching does not train the data score directly

DSM trains sθ(x~)x~logqσ(x~)s_\theta(\tilde{x}) \approx \nabla_{\tilde{x}} \log q_\sigma (\tilde{x}), the score of the noisy data density qσq_\sigma. As σ0\sigma \to 0, qσpdataq_\sigma \to p_{\text{data}} and the DSM target approaches the data score, but for any fixed σ>0\sigma > 0 they are different. To learn the data score itself, you need a schedule of σ\sigma values approaching zero, or a continuous-time SDE formulation (Song et al. 2021) where σ(t)\sigma(t) is part of the model.

Exercises

ExerciseCore

Problem

Derive the denoising score matching target for Gaussian noise: x~=x+σϵ\tilde{x} = x + \sigma \epsilon with ϵN(0,I)\epsilon \sim \mathcal{N}(0, I) and xpdatax \sim p_{\text{data}}. Show that the DSM loss reduces, up to multiplicative and additive constants in θ\theta, to Ex,ϵσsθ(x~)+ϵ2\mathbb{E}_{x, \epsilon}\,\|\sigma\, s_\theta(\tilde{x}) + \epsilon\|^2.

ExerciseAdvanced

Problem

Prove Tweedie's formula: for x~=x+σϵ\tilde{x} = x + \sigma \epsilon with ϵN(0,I)\epsilon \sim \mathcal{N}(0, I) independent of xpdatax \sim p_{\text{data}} and qσ(x~)=qσ(x~x)pdata(x)dxq_\sigma(\tilde{x}) = \int q_\sigma(\tilde{x} | x) p_{\text{data}}(x) \,dx,

E[xx~]=x~+σ2x~logqσ(x~).\mathbb{E}[x \mid \tilde{x}] = \tilde{x} + \sigma^2\, \nabla_{\tilde{x}} \log q_\sigma(\tilde{x}).

References

No canonical references provided.

Next Topics

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics