ML Methods
Boltzmann Machines and Hopfield Networks
Energy-based models for associative memory and generative learning: Hopfield networks store patterns via energy minimization, Boltzmann machines add stochasticity and hidden units, and RBMs enable tractable learning through contrastive divergence.
Prerequisites
Why This Matters
Before backpropagation dominated, energy-based models were the primary framework for neural computation. Hopfield networks (1982) showed that a network of binary neurons with symmetric weights could store and retrieve patterns by minimizing an energy function. Boltzmann machines (1985) extended this idea with stochastic units and hidden variables, creating one of the first principled generative models. Restricted Boltzmann machines (RBMs) became practical building blocks for deep learning in the mid-2000s, preceding the current era of deep feedforward networks.
Understanding these models clarifies the connection between neural networks, statistical physics, and modern energy-based approaches.
Hopfield Networks
Hopfield Network
A Hopfield network is a fully connected network of binary neurons with states , symmetric weights with , and energy function:
where are bias terms.
The update rule is asynchronous: pick a neuron at random and set . Each update either decreases the energy or leaves it unchanged.
Storing patterns. Given patterns with , the Hebbian learning rule sets:
Each stored pattern becomes a local minimum of the energy function.
Hopfield Convergence Theorem
Statement
Under asynchronous updates, the energy is non-increasing at each step. The network converges to a fixed point (local energy minimum) in finite time.
Intuition
Flipping neuron changes the energy by . The update rule chooses to align with the local field, so . Since the energy is bounded below and takes finitely many values, the process terminates.
Proof Sketch
Let be the local field at neuron . The update sets . The energy change is . If changes, then and , giving . If does not change, . The state space is finite, so the process must terminate.
Why It Matters
This theorem guarantees that Hopfield networks always reach a stable state, making them useful as content-addressable memories. Present a corrupted version of a stored pattern, and the network dynamics will (often) recover the original.
Failure Mode
The capacity is limited. With Hebbian learning, reliable retrieval requires the number of stored patterns . Beyond this threshold, spurious minima proliferate and retrieval fails. The patterns must also be roughly uncorrelated.
Boltzmann Machines
Boltzmann Machine
A Boltzmann machine replaces deterministic updates with stochastic ones. Each neuron takes state with probability:
where is the logistic sigmoid. The network defines a joint distribution , which is a Gibbs distribution.
With hidden units and visible units , the model defines a marginal distribution over observed data. Learning maximizes the log-likelihood of training data by adjusting weights.
The gradient of the log-likelihood has a clean form:
The first term ("positive phase") clamps visible units to data and samples hidden units. The second term ("negative phase") requires sampling from the full joint, which is intractable for general Boltzmann machines.
Restricted Boltzmann Machines
Restricted Boltzmann Machine (RBM)
An RBM restricts the connectivity to a bipartite graph: visible units connect only to hidden units , with no within-layer connections. This yields conditional independence:
Each conditional is a product of independent Bernoullis, enabling exact parallel sampling within each layer.
Contrastive Divergence Approximation
Statement
Contrastive divergence with Gibbs steps (CD-) approximates the log-likelihood gradient by replacing the intractable model expectation with a -step Gibbs sample starting from the data:
where "recon" denotes the reconstruction after alternating Gibbs steps.
Intuition
Instead of running Gibbs sampling to convergence (which could take exponentially long), CD- runs only steps starting from the training data. The data distribution is already close to the model distribution if the model is reasonable, so a few steps suffice to get a useful gradient signal.
Proof Sketch
Hinton (2002) showed that CD- follows the gradient of a different objective function (the difference of two KL divergences). It is not the exact log-likelihood gradient, but it moves in a direction that reduces a related divergence measure. Formal analysis by Carreira-Perpinan and Hinton (2005) showed the bias decreases as increases.
Why It Matters
CD-1 (a single Gibbs step) made RBM training practical. This enabled the deep belief network revolution (Hinton et al. 2006), where RBMs were stacked and trained greedily layer by layer. This was the first successful method for training deep generative models and led directly to the modern deep learning era.
Failure Mode
CD- is a biased estimator of the gradient. For small , the model may fail to learn long-range correlations. In practice, CD-1 often underestimates the negative phase, leading to models that assign probability mass to regions far from the data. Persistent contrastive divergence (PCD) partially addresses this by maintaining persistent Markov chains across parameter updates.
Common Confusions
Hopfield networks are not Boltzmann machines
Hopfield networks use deterministic updates and find local energy minima. Boltzmann machines use stochastic updates and sample from a Gibbs distribution. Adding temperature to a Hopfield network with turns it into a Boltzmann machine. At , the Boltzmann machine recovers deterministic Hopfield dynamics.
RBMs are not deep models
A single RBM has one hidden layer. Deep belief networks (DBNs) stack multiple RBMs, training each greedily. The depth comes from stacking, not from the RBM itself.
Canonical Examples
Pattern retrieval in a Hopfield network
Store two patterns in a 4-neuron Hopfield network: and . The weight matrix (with Hebbian rule) is . Starting from the corrupted state (one bit flipped from ), asynchronous updates converge to , recovering the stored pattern.
Exercises
Problem
For a Hopfield network with neurons using Hebbian learning, approximately how many uncorrelated binary patterns can be reliably stored and retrieved?
Problem
In an RBM with visible units and hidden units, how many parameters does the model have? Write the conditional and explain why exact sampling from is tractable conditioned on one layer but intractable marginally.
References
Canonical:
- Hopfield, "Neural networks and physical systems with emergent collective computational abilities" (1982)
- Ackley, Hinton, Sejnowski, "A Learning Algorithm for Boltzmann Machines" (1985)
- Hinton, "Training Products of Experts by Minimizing Contrastive Divergence" (2002)
Current:
-
Goodfellow, Bengio, Courville, Deep Learning (2016), Chapter 20
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
Next Topics
- Autoencoders: another approach to learning representations with hidden variables
- Diffusion models: modern energy-based generative models
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A