ML Methods
Score Matching
Hyvärinen 2005: train a model to estimate ∇ log p(x) without ever computing the normalization constant. Integration by parts converts the intractable density-matching loss into a tractable gradient-based objective. The training objective that powers every score-based diffusion model and every modern energy-based model.
Why This Matters
Most probabilistic models cannot be trained by maximum likelihood because the normalization constant is intractable. Energy-based models, score-based diffusion, and most generative samplers all face this obstacle. Score matching (Hyvärinen 2005) is the workaround: instead of fitting the density directly, fit the score . The score is invariant to the partition function — — so matching scores never requires knowing .
Hyvärinen's 2005 paper turned this idea into a tractable loss via integration by parts: the squared-distance objective between model and data scores reduces, up to a constant, to a quantity computable from samples and a single Hessian trace. Vincent (2011) replaced the Hessian trace with a denoising target, giving the denoising score matching loss that scales to high dimensions and underlies every modern diffusion model. Song and Ermon (2019) added multiple noise scales and turned the loss into the noise-conditional score network (NCSN); Song et al. (2021) wrapped the entire setup inside a forward noising SDE and used Anderson's time-reversal theorem to convert the learned score into a generative sampler.
The thread from Hyvärinen 2005 to Stable Diffusion is straight: every score-based generative model is doing denoising score matching at multiple noise levels, then plugging the learned score into a reverse SDE. If you understand score matching, you understand the training half of diffusion.
Mental Model
The score is a vector field on that points from low-density regions toward high-density regions. It is the direction of steepest log-density increase and the gradient field that Langevin dynamics follows to sample from .
Score matching trains a parametric vector field to approximate this score field. The natural loss is the squared error between data and model scores, weighted by the data density:
This is the explicit score matching loss. It is intractable because we do not know . Hyvärinen's trick is integration by parts: under mild boundary conditions, the explicit loss equals (up to a constant in ) a tractable implicit loss that only requires samples and the Jacobian of .
Hyvärinen's Implicit Score Matching
Implicit Score Matching Loss
The implicit score matching loss for a score model is
where is the Jacobian of at and is its trace (the divergence of the score field).
The implicit loss is computable from data samples alone. There is no reference to and no normalization constant.
Hyvärinen's Implicit Score Matching Theorem
Statement
Under the assumptions above,
where is the data Fisher information, independent of .
Intuition
The cross term in the explicit loss is what we cannot evaluate. But , so . Integration by parts converts into . The intractable density gradient becomes a tractable divergence of the model itself.
Proof Sketch
Expand the squared-distance loss: . The third term integrates to , independent of . For the cross term, use to write . Apply integration by parts component-wise: (boundary terms vanish by assumption). Summing over gives . Substitute back into the expansion to obtain .
Why It Matters
This single integration-by-parts identity lets you train a model of the score without ever knowing the data density, without normalization constants, and without samples from the model. You compute and the trace of the Jacobian on data samples and minimize their sum. For density estimation, sampling, and energy-based modeling, this is the foundational equivalence; every later method (denoising, sliced, multi-noise) is a way of estimating the same loss more efficiently. , where does not depend on . Minimizing is equivalent to minimizing .
Failure Mode
The Jacobian trace requires backward passes to compute exactly (one per coordinate of ). For to (images, video, audio), this is prohibitive. The implicit loss is theoretically clean but computationally infeasible in high dimensions. This is the gap that denoising score matching (Vincent 2011) and sliced score matching (Song et al. 2020) were invented to close.
Denoising Score Matching
Vincent's Denoising Score Matching Identity
Statement
Define the denoising score matching loss
Then , where is the explicit score matching loss against the noisy marginal , and is independent of . Minimizing trains to match the score of the noisy data distribution .
Intuition
For Gaussian noise , the conditional score is , which is just the negative noise direction scaled by . The DSM loss reduces to (up to constants), which is plain least-squares regression of the model output onto the noise that was added. The Jacobian trace disappears entirely; you only need a forward pass and a regression target.
Proof Sketch
Expand the explicit score matching loss against and substitute . The cross term becomes by Bayes rule and the identity . The remainder is algebraic completion of the square.
Why It Matters
Denoising score matching scales to image, audio, and video models because its loss is a single regression target per sample: no Jacobians, no divergences, no second-order quantities. This is why diffusion training is architecturally identical to supervised regression: input is a noisy image, target is the noise vector, loss is MSE. Every pixel-space and latent-space diffusion model in production trains exactly this objective across many noise scales. Without Vincent's identity, score-based generative modeling at scale would not exist. The score-matching loss against the noisy marginal equals (up to a -independent constant) the regression loss between and .
Failure Mode
DSM trains the score of the noisy density , not the original . For small , the noisy score approximates the data score, but only away from the data support, since the noisy density smooths out the manifold structure of the data. Single-noise-level DSM therefore cannot generate clean samples; you need a schedule of noise levels (NCSN, Song-Ermon 2019; or a continuous SDE noising process, Song et al. 2021) and a score model conditioned on . The single-scale version is provably a bad sampler in high dimensions.
Sliced Score Matching and Tweedie's Formula
Sliced score matching (Song et al. 2020) replaces the Jacobian trace with a stochastic estimator for sampled from a fixed distribution (e.g., Rademacher). This costs one Jacobian-vector product per sample instead of , restoring tractability without requiring noise. It is the natural choice when noise injection is undesirable (e.g., discrete data after relaxation).
Tweedie's formula gives the posterior-mean interpretation of the score: for with ,
A trained denoising score network is therefore equivalent to a posterior-mean denoiser via . This identity is why "diffusion" and "denoising" are interchangeable framings: they both train the same network, parameterized differently. Karras et al. (2022) make this explicit in EDM: the network outputs a denoised estimate , and the score is recovered as .
Worked Example: Gaussian Score Matching
Take and a linear score model for , . The true score is , so the optimum is , .
Compute the implicit score matching loss: . Take expectation under : .
Setting gradients to zero in gives . Setting gradients to zero in gives , i.e., . The implicit loss recovers the exact maximum-likelihood Gaussian fit, with no direct access to the density. This is the cleanest demonstration of the Hyvärinen identity in closed form.
Common Confusions
Score matching is NOT minimizing KL to the data distribution
Score matching minimizes the squared distance between score fields, not the KL divergence between distributions. The two objectives have different gradients and different optima in general. They agree at the global minimum (both achieve zero only when ), but the geometry of the optimization landscape is different. In particular, score matching is consistent under the model being well-specified, but the bias-variance tradeoff in finite samples differs from MLE.
The Jacobian trace is unrelated to the model output's norm
Beginners sometimes confuse with or with the Jacobian's Frobenius norm. The trace is the divergence of the score field — — which can be negative, zero, or positive independently of the score's magnitude. The implicit score matching loss has both terms, and the trace term is what makes the loss work; without it, the optimum would be .
Denoising score matching does not train the data score directly
DSM trains , the score of the noisy data density . As , and the DSM target approaches the data score, but for any fixed they are different. To learn the data score itself, you need a schedule of values approaching zero, or a continuous-time SDE formulation (Song et al. 2021) where is part of the model.
Exercises
Problem
Derive the denoising score matching target for Gaussian noise: with and . Show that the DSM loss reduces, up to multiplicative and additive constants in , to .
Problem
Prove Tweedie's formula: for with independent of and ,
References
No canonical references provided.
No current references provided.
No frontier references provided.
Next Topics
- Diffusion Models: the generative modeling framework built on multi-scale denoising score matching.
- Time Reversal of SDEs: Anderson's theorem that turns the learned score into a generative sampler.
- Langevin Dynamics: the SDE that the learned score field drives at sampling time.
- Fokker–Planck Equation: the PDE that governs the noisy marginals along the noising schedule.
- Stochastic Differential Equations: the continuous-time framework for the forward noising process.
Last reviewed: April 18, 2026
Prerequisites
Foundations this topic depends on.
- Stochastic Differential EquationsLayer 3
- Brownian MotionLayer 2
- Measure-Theoretic ProbabilityLayer 0B
- Martingale TheoryLayer 0B
- Ito's LemmaLayer 3
- Stochastic Calculus for MLLayer 3
- Fokker–Planck EquationLayer 3
- PDE Fundamentals for Machine LearningLayer 1
- Fast Fourier TransformLayer 1
- Exponential Function PropertiesLayer 0A
- Eigenvalues and EigenvectorsLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Functional Analysis CoreLayer 0B
- Metric Spaces, Convergence, and CompletenessLayer 0A
- Inner Product Spaces and OrthogonalityLayer 0A
- Vectors, Matrices, and Linear MapsLayer 0A
Builds on This
- Probability Flow ODELayer 3