Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Energy-Based Models

A unifying framework for generative modeling: assign low energy to likely configurations via E(x), define probability through the Boltzmann distribution, and train without computing the intractable partition function.

AdvancedTier 3Current~55 min

Prerequisites

0

Why This Matters

Energy-based models provide the most general framework for specifying probability distributions over high-dimensional data. Every probabilistic model --- GANs, VAEs, diffusion models, Boltzmann machines --- can be viewed through the energy lens. Understanding EBMs gives you a unified language for generative modeling and reveals why certain training strategies work.

The recent explosion of diffusion models is. It is an EBM story: the score function that diffusion models learn is exactly the gradient of the log-density of an energy-based model.

Mental Model

Think of the energy function E(x)E(x) as a landscape over the data space. Valleys (low energy) correspond to likely data configurations. Peaks (high energy) correspond to unlikely configurations. Training an EBM means sculpting this landscape so that the valleys align with the data distribution: real images get low energy, noise gets high energy.

The challenge: converting this energy landscape into a proper probability distribution requires normalizing over all possible configurations --- an intractable integral in high dimensions. The entire field of EBM training is about avoiding or approximating this normalization.

Formal Setup and Notation

Let xXRdx \in \mathcal{X} \subseteq \mathbb{R}^d be a data point. An energy-based model defines an energy function Eθ:XRE_\theta: \mathcal{X} \to \mathbb{R} parameterized by θ\theta.

Definition

Energy Function

The energy function Eθ(x)E_\theta(x) assigns a scalar energy to each configuration xx. Lower energy means higher probability. There are no constraints on EθE_\theta --- it can be any function from X\mathcal{X} to R\mathbb{R}, including a neural network.

Definition

Boltzmann Distribution

The energy function induces a probability distribution via the Boltzmann distribution (also called the Gibbs distribution):

pθ(x)=exp(Eθ(x))Z(θ)p_\theta(x) = \frac{\exp(-E_\theta(x))}{Z(\theta)}

where the partition function is:

Z(θ)=Xexp(Eθ(x))dxZ(\theta) = \int_{\mathcal{X}} \exp(-E_\theta(x))\, dx

The partition function ensures pθp_\theta integrates to 1. Computing Z(θ)Z(\theta) requires integrating over all of X\mathcal{X} --- this is intractable for high-dimensional xx.

Core Definitions

The score function of an EBM is the gradient of the log-density:

xlogpθ(x)=xEθ(x)\nabla_x \log p_\theta(x) = -\nabla_x E_\theta(x)

The partition function Z(θ)Z(\theta) disappears because it does not depend on xx. This observation is the foundation of score-based methods.

The free energy is F(θ)=logZ(θ)F(\theta) = -\log Z(\theta). The log-likelihood of a single observation is:

logpθ(x)=Eθ(x)logZ(θ)=Eθ(x)+F(θ)\log p_\theta(x) = -E_\theta(x) - \log Z(\theta) = -E_\theta(x) + F(\theta)

Main Theorems

Theorem

MLE Gradient for Energy-Based Models

Statement

The gradient of the log-likelihood with respect to parameters θ\theta is:

θlogpθ(x)=θEθ(x)+Expθ[θEθ(x)]\nabla_\theta \log p_\theta(x) = -\nabla_\theta E_\theta(x) + \mathbb{E}_{x' \sim p_\theta}[\nabla_\theta E_\theta(x')]

Equivalently, the gradient of the average log-likelihood over the data distribution pdatap_{\text{data}} is:

θEpdata[logpθ(x)]=Epdata[θEθ(x)]+Epθ[θEθ(x)]\nabla_\theta \mathbb{E}_{p_{\text{data}}}[\log p_\theta(x)] = -\mathbb{E}_{p_{\text{data}}}[\nabla_\theta E_\theta(x)] + \mathbb{E}_{p_\theta}[\nabla_\theta E_\theta(x)]

Intuition

MLE training pushes down the energy of real data (first term) and pushes up the energy of model samples (second term). This is a "contrastive" update: make real data more likely by making it lower energy, and make model samples less likely by making them higher energy. The challenge is computing the second expectation, which requires sampling from pθp_\theta --- itself an intractable distribution.

Proof Sketch

Differentiate logpθ(x)=Eθ(x)logZ(θ)\log p_\theta(x) = -E_\theta(x) - \log Z(\theta). The first term gives θEθ(x)-\nabla_\theta E_\theta(x). For the second, note:

θlogZ(θ)=θZ(θ)Z(θ)=exp(Eθ(x))(θEθ(x))dxZ(θ)=Epθ[θEθ(x)]\nabla_\theta \log Z(\theta) = \frac{\nabla_\theta Z(\theta)}{Z(\theta)} = \frac{\int \exp(-E_\theta(x'))(-\nabla_\theta E_\theta(x'))\,dx'}{Z(\theta)} = -\mathbb{E}_{p_\theta}[\nabla_\theta E_\theta(x')]

Combining: θlogpθ(x)=θEθ(x)+Epθ[θEθ(x)]\nabla_\theta \log p_\theta(x) = -\nabla_\theta E_\theta(x) + \mathbb{E}_{p_\theta}[\nabla_\theta E_\theta(x')].

Why It Matters

This gradient formula is the starting point for all EBM training methods. Every training algorithm for EBMs is a strategy for approximating the model expectation Epθ[θEθ(x)]\mathbb{E}_{p_\theta}[\nabla_\theta E_\theta(x')], which requires sampling from pθp_\theta.

Failure Mode

If the MCMC sampler used to approximate the model expectation does not mix well, the gradient estimate is biased. The model may learn to place low energy only on the MCMC chain's trajectory rather than on the true data manifold, leading to poor generalization.

Training Methods

Contrastive Divergence (CD)

Since sampling from pθp_\theta is intractable, Hinton (2002) proposed contrastive divergence: initialize an MCMC chain at a data point and run it for only kk steps (typically k=1k = 1). Use the resulting sample x~\tilde{x} as an approximation to a sample from pθp_\theta.

The CD-kk gradient is:

θCD-k=θEθ(xdata)+θEθ(x~k)\nabla_\theta^{\text{CD-}k} = -\nabla_\theta E_\theta(x_{\text{data}}) + \nabla_\theta E_\theta(\tilde{x}_k)

CD is biased (the chain has not converged to pθp_\theta) but works surprisingly well in practice. It was the workhorse for training restricted Boltzmann machines and deep belief networks.

Score Matching

Theorem

Score Matching Avoids the Partition Function

Statement

The Fisher divergence between pdatap_{\text{data}} and pθp_\theta:

DF(pdatapθ)=12Epdata ⁣[xlogpdata(x)xlogpθ(x)2]D_F(p_{\text{data}} \| p_\theta) = \frac{1}{2}\mathbb{E}_{p_{\text{data}}}\!\left[\|\nabla_x \log p_{\text{data}}(x) - \nabla_x \log p_\theta(x)\|^2\right]

can be rewritten (via integration by parts) as:

DF=Epdata ⁣[12xEθ(x)2+tr(x2Eθ(x))]+constD_F = \mathbb{E}_{p_{\text{data}}}\!\left[\frac{1}{2}\|\nabla_x E_\theta(x)\|^2 + \text{tr}(\nabla_x^2 E_\theta(x))\right] + \text{const}

where the constant does not depend on θ\theta.

Intuition

Score matching trains the model by matching the gradient of the log-density (the score) rather than the density itself. Since the score xlogpθ(x)=xEθ(x)\nabla_x \log p_\theta(x) = -\nabla_x E_\theta(x) does not involve Z(θ)Z(\theta), the partition function drops out entirely. You can train an EBM without ever computing or approximating Z(θ)Z(\theta).

Connection to Diffusion Models

Diffusion models learn a score function sθ(x,t)xlogpt(x)s_\theta(x, t) \approx \nabla_x \log p_t(x) where ptp_t is the data distribution convolved with Gaussian noise of variance tt. This is precisely a noise-conditional score matching objective applied to an EBM at each noise level.

The denoising score matching identity:

Epdata(x0)Ep(xx0) ⁣[sθ(x,t)xlogp(xx0)2]\mathbb{E}_{p_{\text{data}}(x_0)}\mathbb{E}_{p(x|x_0)}\!\left[\|s_\theta(x, t) - \nabla_x \log p(x|x_0)\|^2\right]

trains the model to point toward the clean data from the noisy version. This is equivalent to learning the energy gradient xEθ-\nabla_x E_\theta at multiple noise scales.

Canonical Examples

Example

Restricted Boltzmann Machine

An RBM defines energy over visible units vv and hidden units hh: E(v,h)=vTWhbTvcThE(v, h) = -v^T W h - b^T v - c^T h. The marginal p(v)p(v) is obtained by summing out hh, which is tractable because hh is conditionally independent given vv. RBMs were the first scalable EBMs, trained with CD.

Example

Modern deep EBM

Parameterize Eθ(x)E_\theta(x) as a deep convolutional network mapping images to a scalar energy. Train via MCMC-based contrastive learning or score matching. The resulting model assigns low energy to realistic images and high energy to noise, enabling generation via Langevin dynamics (gradient descent on EE with added noise).

Common Confusions

Watch Out

Low energy does not mean the point is a mode

A point can have low energy but low probability if it sits in a region where many other points also have low energy (high partition function contribution from that region). Probability depends on both energy and volume. This is the energy-entropy tradeoff.

Watch Out

EBMs are not just Boltzmann machines

Boltzmann machines are a specific type of EBM with bilinear energy functions over binary variables. Modern EBMs use arbitrary neural network energy functions over continuous spaces. The framework is far more general than its historical association with Boltzmann machines suggests.

Summary

  • Energy function Eθ(x)E_\theta(x): low energy = high probability
  • Boltzmann distribution: pθ(x)=exp(Eθ(x))/Z(θ)p_\theta(x) = \exp(-E_\theta(x))/Z(\theta)
  • Partition function Z(θ)Z(\theta) is intractable in high dimensions
  • MLE gradient = push down data energy, push up model-sample energy
  • Score function xlogp=xE\nabla_x \log p = -\nabla_x E avoids Z(θ)Z(\theta)
  • Diffusion models are score-matching EBMs at multiple noise levels

Exercises

ExerciseCore

Problem

Show that the score function xlogpθ(x)=xEθ(x)\nabla_x \log p_\theta(x) = -\nabla_x E_\theta(x) does not depend on the partition function Z(θ)Z(\theta).

ExerciseAdvanced

Problem

Derive the MLE gradient for an EBM. Specifically, show that θlogZ(θ)=Epθ[θEθ(x)]\nabla_\theta \log Z(\theta) = -\mathbb{E}_{p_\theta}[\nabla_\theta E_\theta(x)].

ExerciseResearch

Problem

Contrastive divergence with k=1k=1 step is known to be a biased estimator of the MLE gradient. Explain intuitively why this bias exists and describe a scenario where it causes the model to fail.

References

Canonical:

  • LeCun, Chopra, Hadsell, Ranzato, Huang, A Tutorial on Energy-Based Learning (2006)
  • Hinton, Training Products of Experts by Minimizing Contrastive Divergence (2002)

Current:

  • Song & Ermon, Generative Modeling by Estimating Gradients of the Data Distribution (NeurIPS 2019)

  • Du & Mordatch, Implicit Generation and Modeling with Energy-Based Models (NeurIPS 2019)

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

Next Topics

The natural next steps from energy-based models:

  • Diffusion models: score matching at multiple noise levels, the modern EBM
  • [Variational autoencoders](/topics/autoencoders): a different approach to intractable normalization via amortized inference

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics