ML Methods
Energy-Based Models
A unifying framework for generative modeling: assign low energy to likely configurations via E(x), define probability through the Boltzmann distribution, and train without computing the intractable partition function.
Prerequisites
Why This Matters
Energy-based models provide the most general framework for specifying probability distributions over high-dimensional data. Every probabilistic model --- GANs, VAEs, diffusion models, Boltzmann machines --- can be viewed through the energy lens. Understanding EBMs gives you a unified language for generative modeling and reveals why certain training strategies work.
The recent explosion of diffusion models is. It is an EBM story: the score function that diffusion models learn is exactly the gradient of the log-density of an energy-based model.
Mental Model
Think of the energy function as a landscape over the data space. Valleys (low energy) correspond to likely data configurations. Peaks (high energy) correspond to unlikely configurations. Training an EBM means sculpting this landscape so that the valleys align with the data distribution: real images get low energy, noise gets high energy.
The challenge: converting this energy landscape into a proper probability distribution requires normalizing over all possible configurations --- an intractable integral in high dimensions. The entire field of EBM training is about avoiding or approximating this normalization.
Formal Setup and Notation
Let be a data point. An energy-based model defines an energy function parameterized by .
Energy Function
The energy function assigns a scalar energy to each configuration . Lower energy means higher probability. There are no constraints on --- it can be any function from to , including a neural network.
Boltzmann Distribution
The energy function induces a probability distribution via the Boltzmann distribution (also called the Gibbs distribution):
where the partition function is:
The partition function ensures integrates to 1. Computing requires integrating over all of --- this is intractable for high-dimensional .
Core Definitions
The score function of an EBM is the gradient of the log-density:
The partition function disappears because it does not depend on . This observation is the foundation of score-based methods.
The free energy is . The log-likelihood of a single observation is:
Main Theorems
MLE Gradient for Energy-Based Models
Statement
The gradient of the log-likelihood with respect to parameters is:
Equivalently, the gradient of the average log-likelihood over the data distribution is:
Intuition
MLE training pushes down the energy of real data (first term) and pushes up the energy of model samples (second term). This is a "contrastive" update: make real data more likely by making it lower energy, and make model samples less likely by making them higher energy. The challenge is computing the second expectation, which requires sampling from --- itself an intractable distribution.
Proof Sketch
Differentiate . The first term gives . For the second, note:
Combining: .
Why It Matters
This gradient formula is the starting point for all EBM training methods. Every training algorithm for EBMs is a strategy for approximating the model expectation , which requires sampling from .
Failure Mode
If the MCMC sampler used to approximate the model expectation does not mix well, the gradient estimate is biased. The model may learn to place low energy only on the MCMC chain's trajectory rather than on the true data manifold, leading to poor generalization.
Training Methods
Contrastive Divergence (CD)
Since sampling from is intractable, Hinton (2002) proposed contrastive divergence: initialize an MCMC chain at a data point and run it for only steps (typically ). Use the resulting sample as an approximation to a sample from .
The CD- gradient is:
CD is biased (the chain has not converged to ) but works surprisingly well in practice. It was the workhorse for training restricted Boltzmann machines and deep belief networks.
Score Matching
Score Matching Avoids the Partition Function
Statement
The Fisher divergence between and :
can be rewritten (via integration by parts) as:
where the constant does not depend on .
Intuition
Score matching trains the model by matching the gradient of the log-density (the score) rather than the density itself. Since the score does not involve , the partition function drops out entirely. You can train an EBM without ever computing or approximating .
Connection to Diffusion Models
Diffusion models learn a score function where is the data distribution convolved with Gaussian noise of variance . This is precisely a noise-conditional score matching objective applied to an EBM at each noise level.
The denoising score matching identity:
trains the model to point toward the clean data from the noisy version. This is equivalent to learning the energy gradient at multiple noise scales.
Canonical Examples
Restricted Boltzmann Machine
An RBM defines energy over visible units and hidden units : . The marginal is obtained by summing out , which is tractable because is conditionally independent given . RBMs were the first scalable EBMs, trained with CD.
Modern deep EBM
Parameterize as a deep convolutional network mapping images to a scalar energy. Train via MCMC-based contrastive learning or score matching. The resulting model assigns low energy to realistic images and high energy to noise, enabling generation via Langevin dynamics (gradient descent on with added noise).
Common Confusions
Low energy does not mean the point is a mode
A point can have low energy but low probability if it sits in a region where many other points also have low energy (high partition function contribution from that region). Probability depends on both energy and volume. This is the energy-entropy tradeoff.
EBMs are not just Boltzmann machines
Boltzmann machines are a specific type of EBM with bilinear energy functions over binary variables. Modern EBMs use arbitrary neural network energy functions over continuous spaces. The framework is far more general than its historical association with Boltzmann machines suggests.
Summary
- Energy function : low energy = high probability
- Boltzmann distribution:
- Partition function is intractable in high dimensions
- MLE gradient = push down data energy, push up model-sample energy
- Score function avoids
- Diffusion models are score-matching EBMs at multiple noise levels
Exercises
Problem
Show that the score function does not depend on the partition function .
Problem
Derive the MLE gradient for an EBM. Specifically, show that .
Problem
Contrastive divergence with step is known to be a biased estimator of the MLE gradient. Explain intuitively why this bias exists and describe a scenario where it causes the model to fail.
References
Canonical:
- LeCun, Chopra, Hadsell, Ranzato, Huang, A Tutorial on Energy-Based Learning (2006)
- Hinton, Training Products of Experts by Minimizing Contrastive Divergence (2002)
Current:
-
Song & Ermon, Generative Modeling by Estimating Gradients of the Data Distribution (NeurIPS 2019)
-
Du & Mordatch, Implicit Generation and Modeling with Energy-Based Models (NeurIPS 2019)
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
Next Topics
The natural next steps from energy-based models:
- Diffusion models: score matching at multiple noise levels, the modern EBM
- [Variational autoencoders](/topics/autoencoders): a different approach to intractable normalization via amortized inference
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- AutoencodersLayer 2
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A
- Vectors, Matrices, and Linear MapsLayer 0A