Energy-Based Models

Sneiderman, Robby

ML Methods

Energy-Based Models

A unifying framework for generative modeling: assign low energy to likely configurations via E(x), define probability through the Boltzmann distribution, and train without computing the intractable partition function.

AdvancedTier 3CurrentSupporting~55 min

Prerequisites

Maximum Likelihood Estimation Feedforward Networks and Backpropagation Neural Sdes Normalization Flows

Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 3 | tier 3. This page has 5 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Diffusion Models

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Energy-based models provide the most general framework for specifying probability distributions over high-dimensional data. Every probabilistic model --- GANs, VAEs, diffusion models, Boltzmann machines --- can be viewed through the energy lens. Understanding EBMs gives you a unified language for generative modeling and reveals why certain training strategies work.

The recent explosion of diffusion models is an EBM story: the score function that diffusion models learn is exactly the gradient of the log-density of an energy-based model.

An energy-based model is a factor graph whose local terms add up to one scalar score

The rectangles are additive energy terms over observed and latent variables. The rounded block on the right is the easy forward part: once the configuration is fixed, it computes one number, the energy.

The subtle point is that inference and scoring are different. Given a full configuration, the D-shaped block computes the energy directly. But recovering good latent states still means searching for configurations where the factor terms agree.

Diagram language

filled circle = observed variable or chosen action

hollow circle = latent variable the system must infer or roll forward

rectangle = factor / energy term

rounded-side box = forward-computable function

Observed versus latent structure

The filled circles are the visible data. The hollow circles are the hidden units or latent variables whose configuration determines whether the local energy terms can all be small at once.

Probability only appears after normalization

$The factor graph defines E_{θ} (x, h) . The corresponding probability is p_{θ} (x, h) = Z (θ)^{- 1} exp (- E_{θ} (x, h)), so the partition function is what turns a score into a normalized model.$

Why this notation helps

Thinking in factors makes it obvious which variables interact locally and which part is just a forward computation. That is the same distinction that reappears in modern world-model diagrams.

Mental Model

Think of the energy function $E(x)$ as a landscape over the data space. Valleys (low energy) correspond to likely data configurations. Peaks (high energy) correspond to unlikely configurations. Training an EBM means sculpting this landscape so that the valleys align with the data distribution: real images get low energy, noise gets high energy.

The challenge: converting this energy landscape into a proper probability distribution requires normalizing over all possible configurations --- an intractable integral in high dimensions. The entire field of EBM training is about avoiding or approximating this normalization.

Formal Setup and Notation

Let $x \in \mathcal{X} \subseteq \mathbb{R}^d$ be a data point. An energy-based model defines an energy function $E_\theta: \mathcal{X} \to \mathbb{R}$ parameterized by $\theta$ .

Definition

Energy Function $E_{θ} (x)$

The energy function $E_\theta(x)$ assigns a scalar energy to each configuration $x$ . Lower energy means higher probability. There are no constraints on $E_\theta$ --- it can be any function from $\mathcal{X}$ to $\mathbb{R}$ , including a neural network.

Definition

Boltzmann Distribution $p_{θ} (x)$

The energy function induces a probability distribution via the Boltzmann distribution (also called the Gibbs distribution):

$p_\theta(x) = \frac{\exp(-E_\theta(x))}{Z(\theta)}$

where the partition function is:

$Z(\theta) = \int_{\mathcal{X}} \exp(-E_\theta(x))\, dx$

The partition function ensures $p_\theta$ integrates to 1. Computing $Z(\theta)$ requires integrating over all of $\mathcal{X}$ --- this is intractable for high-dimensional $x$ .

Core Definitions

The score function of an EBM is the gradient of the log-density:

$\nabla_x \log p_\theta(x) = -\nabla_x E_\theta(x)$

The partition function $Z(\theta)$ disappears because it does not depend on $x$ . This observation is the foundation of score-based methods.

The free energy is $F(\theta) = -\log Z(\theta)$ . The log-likelihood of a single observation is:

$\log p_\theta(x) = -E_\theta(x) - \log Z(\theta) = -E_\theta(x) + F(\theta)$

Main Theorems

Theorem

MLE Gradient for Energy-Based Models

Statement

The gradient of the log-likelihood with respect to parameters $\theta$ is:

$\nabla_\theta \log p_\theta(x) = -\nabla_\theta E_\theta(x) + \mathbb{E}_{x' \sim p_\theta}[\nabla_\theta E_\theta(x')]$

Equivalently, the gradient of the average log-likelihood over the data distribution $p_{\text{data}}$ is:

$\nabla_\theta \mathbb{E}_{p_{\text{data}}}[\log p_\theta(x)] = -\mathbb{E}_{p_{\text{data}}}[\nabla_\theta E_\theta(x)] + \mathbb{E}_{p_\theta}[\nabla_\theta E_\theta(x)]$

Intuition

MLE training pushes down the energy of real data (first term) and pushes up the energy of model samples (second term). This is a "contrastive" update: make real data more likely by making it lower energy, and make model samples less likely by making them higher energy. The challenge is computing the second expectation, which requires sampling from $p_\theta$ --- itself an intractable distribution.

Proof Sketch

Differentiate $\log p_\theta(x) = -E_\theta(x) - \log Z(\theta)$ . The first term gives $-\nabla_\theta E_\theta(x)$ . For the second, note:

$\nabla_\theta \log Z(\theta) = \frac{\nabla_\theta Z(\theta)}{Z(\theta)} = \frac{\int \exp(-E_\theta(x'))(-\nabla_\theta E_\theta(x'))\,dx'}{Z(\theta)} = -\mathbb{E}_{p_\theta}[\nabla_\theta E_\theta(x')]$

Combining: $\nabla_\theta \log p_\theta(x) = -\nabla_\theta E_\theta(x) + \mathbb{E}_{p_\theta}[\nabla_\theta E_\theta(x')]$ .

Why It Matters

This gradient formula is the starting point for all EBM training methods. Every training algorithm for EBMs is a strategy for approximating the model expectation $\mathbb{E}_{p_\theta}[\nabla_\theta E_\theta(x')]$ , which requires sampling from $p_\theta$ .

Failure Mode

If the MCMC sampler used to approximate the model expectation does not mix well, the gradient estimate is biased. The model may learn to place low energy only on the MCMC chain's trajectory rather than on the true data manifold, leading to poor generalization.

report a correction →

Training Methods

Contrastive Divergence (CD)

Since sampling from $p_\theta$ is intractable, Hinton (2002) proposed contrastive divergence: initialize an MCMC chain at a data point and run it for only $k$ steps (typically $k = 1$ ). Use the resulting sample $\tilde{x}$ as an approximation to a sample from $p_\theta$ .

The CD- $k$ gradient is:

$\nabla_\theta^{\text{CD-}k} = -\nabla_\theta E_\theta(x_{\text{data}}) + \nabla_\theta E_\theta(\tilde{x}_k)$

CD is biased (the chain has not converged to $p_\theta$ ) but works surprisingly well in practice. It was the workhorse for training restricted Boltzmann machines and deep belief networks.

Score Matching

Theorem

Score Matching Avoids the Partition Function

Statement

The Fisher divergence between $p_{\text{data}}$ and $p_\theta$ :

$D_F(p_{\text{data}} \| p_\theta) = \frac{1}{2}\mathbb{E}_{p_{\text{data}}}\!\left[\|\nabla_x \log p_{\text{data}}(x) - \nabla_x \log p_\theta(x)\|^2\right]$

can be rewritten (via integration by parts) as:

$D_F = \mathbb{E}_{p_{\text{data}}}\!\left[\frac{1}{2}\|\nabla_x E_\theta(x)\|^2 - \text{tr}(\nabla_x^2 E_\theta(x))\right] + \text{const}$

where the constant does not depend on $\theta$ .

Intuition

Score matching trains the model by matching the gradient of the log-density (the score) rather than the density itself. Since the score $\nabla_x \log p_\theta(x) = -\nabla_x E_\theta(x)$ does not involve $Z(\theta)$ , the partition function drops out entirely. You can train an EBM without ever computing or approximating $Z(\theta)$ .

report a correction →

Connection to Diffusion Models

Diffusion models learn a score function $s_\theta(x, t) \approx \nabla_x \log p_t(x)$ where $p_t$ is the data distribution convolved with Gaussian noise of variance $t$ . This is precisely a noise-conditional score matching objective applied to an EBM at each noise level.

The denoising score matching identity:

$\mathbb{E}_{p_{\text{data}}(x_0)}\mathbb{E}_{p(x|x_0)}\!\left[\|s_\theta(x, t) - \nabla_x \log p(x|x_0)\|^2\right]$

trains the model to point toward the clean data from the noisy version. This is equivalent to learning the energy gradient $-\nabla_x E_\theta$ at multiple noise scales.

Canonical Examples

Example

Restricted Boltzmann Machine

An RBM defines energy over visible units $v$ and hidden units $h$ : $E(v, h) = -v^T W h - b^T v - c^T h$ . The marginal $p(v)$ is obtained by summing out $h$ , which is tractable because $h$ is conditionally independent given $v$ . RBMs were the first scalable EBMs, trained with CD.

Example

Modern deep EBM

Parameterize $E_\theta(x)$ as a deep convolutional network mapping images to a scalar energy. Train via MCMC-based contrastive learning or score matching. The resulting model assigns low energy to realistic images and high energy to noise, enabling generation via Langevin dynamics (gradient descent on $E$ with added noise).

Common Confusions

Watch Out

Low energy does not mean the point is a mode

A point can have low energy but low probability if it sits in a region where many other points also have low energy (high partition function contribution from that region). Probability depends on both energy and volume. This is the energy-entropy tradeoff.

Watch Out

EBMs are not just Boltzmann machines

Boltzmann machines are a specific type of EBM with bilinear energy functions over binary variables. Modern EBMs use arbitrary neural network energy functions over continuous spaces. The framework is far more general than its historical association with Boltzmann machines suggests.

Summary

Energy function $E_\theta(x)$ : low energy = high probability
Boltzmann distribution: $p_\theta(x) = \exp(-E_\theta(x))/Z(\theta)$
Partition function $Z(\theta)$ is intractable in high dimensions
MLE gradient = push down data energy, push up model-sample energy
Score function $\nabla_x \log p = -\nabla_x E$ avoids $Z(\theta)$
Diffusion models are score-matching EBMs at multiple noise levels

Exercises

ExerciseCore

Problem

Show that the score function $\nabla_x \log p_\theta(x) = -\nabla_x E_\theta(x)$ does not depend on the partition function $Z(\theta)$ .

ExerciseAdvanced

Problem

Derive the MLE gradient for an EBM. Specifically, show that $\nabla_\theta \log Z(\theta) = -\mathbb{E}_{p_\theta}[\nabla_\theta E_\theta(x)]$ .

ExerciseResearch

Problem

Contrastive divergence with $k=1$ step is known to be a biased estimator of the MLE gradient. Explain intuitively why this bias exists and describe a scenario where it causes the model to fail.

References

Canonical:

Ackley, Hinton, Sejnowski, "A Learning Algorithm for Boltzmann Machines" (Cognitive Science, 1985)
LeCun, Chopra, Hadsell, Ranzato, Huang, A Tutorial on Energy-Based Learning (2006)
Hinton, "Training Products of Experts by Minimizing Contrastive Divergence" (Neural Computation, 2002)
Hyvärinen, "Estimation of Non-Normalized Statistical Models by Score Matching" (JMLR, 2005)
Vincent, "A Connection Between Score Matching and Denoising Autoencoders" (Neural Computation, 2011)

Current:

Song & Ermon, "Generative Modeling by Estimating Gradients of the Data Distribution" (NeurIPS 2019). arXiv:1907.05600
Du & Mordatch, "Implicit Generation and Modeling with Energy-Based Models" (NeurIPS 2019). arXiv:1903.08689
Ho, Jain, Abbeel, "Denoising Diffusion Probabilistic Models" (NeurIPS 2020). arXiv:2006.11239
Grathwohl et al., "Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One" (ICLR 2020). arXiv:1912.03263
Song et al., "Score-Based Generative Modeling through Stochastic Differential Equations" (ICLR 2021). arXiv:2011.13456

Next Topics

The natural next steps from energy-based models:

Diffusion models: score matching at multiple noise levels, the modern EBM
Variational autoencoders: a different approach to intractable normalization via amortized inference

Last reviewed: April 25, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
Feedforward Networks and Backpropagationlayer 2 · tier 1
Score Matchinglayer 3 · tier 1
Normalizing Flowslayer 3 · tier 3
Neural SDEs and the Diffusion Bridgelayer 4 · tier 3

Derived topics

1

Diffusion Modelslayer 4 · tier 1

Graph-backed continuations

Diffusion Models