Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Boltzmann Machines and Hopfield Networks

Energy-based models for associative memory and generative learning: Hopfield networks store patterns via energy minimization, Boltzmann machines add stochasticity and hidden units, and RBMs enable tractable learning through contrastive divergence.

CoreTier 3Stable~50 min
0

Why This Matters

Before backpropagation dominated, energy-based models were the primary framework for neural computation. Hopfield networks (1982) showed that a network of binary neurons with symmetric weights could store and retrieve patterns by minimizing an energy function. Boltzmann machines (1985) extended this idea with stochastic units and hidden variables, creating one of the first principled generative models. Restricted Boltzmann machines (RBMs) became practical building blocks for deep learning in the mid-2000s, preceding the current era of deep feedforward networks.

Understanding these models clarifies the connection between neural networks, statistical physics, and modern energy-based approaches.

Hopfield Networks

Definition

Hopfield Network

A Hopfield network is a fully connected network of nn binary neurons with states si{1,+1}s_i \in \{-1, +1\}, symmetric weights wij=wjiw_{ij} = w_{ji} with wii=0w_{ii} = 0, and energy function:

E(s)=12ijwijsisjibisiE(\mathbf{s}) = -\frac{1}{2} \sum_{i \neq j} w_{ij} s_i s_j - \sum_i b_i s_i

where bib_i are bias terms.

The update rule is asynchronous: pick a neuron ii at random and set sisign(jwijsj+bi)s_i \leftarrow \text{sign}(\sum_j w_{ij} s_j + b_i). Each update either decreases the energy or leaves it unchanged.

Storing patterns. Given pp patterns ξ1,,ξp\boldsymbol{\xi}^1, \ldots, \boldsymbol{\xi}^p with ξiμ{1,+1}\xi_i^\mu \in \{-1, +1\}, the Hebbian learning rule sets:

wij=1nμ=1pξiμξjμw_{ij} = \frac{1}{n} \sum_{\mu=1}^{p} \xi_i^\mu \xi_j^\mu

Each stored pattern becomes a local minimum of the energy function.

Theorem

Hopfield Convergence Theorem

Statement

Under asynchronous updates, the energy E(s)E(\mathbf{s}) is non-increasing at each step. The network converges to a fixed point (local energy minimum) in finite time.

Intuition

Flipping neuron ii changes the energy by ΔE=sinew(jwijsj+bi)+siold(jwijsj+bi)\Delta E = -s_i^{\text{new}}(\sum_j w_{ij} s_j + b_i) + s_i^{\text{old}}(\sum_j w_{ij} s_j + b_i). The update rule chooses sinews_i^{\text{new}} to align with the local field, so ΔE0\Delta E \leq 0. Since the energy is bounded below and takes finitely many values, the process terminates.

Proof Sketch

Let hi=jwijsj+bih_i = \sum_j w_{ij} s_j + b_i be the local field at neuron ii. The update sets sisign(hi)s_i \leftarrow \text{sign}(h_i). The energy change is ΔE=(sinewsiold)hi\Delta E = -(s_i^{\text{new}} - s_i^{\text{old}}) h_i. If sis_i changes, then sinewhi>0s_i^{\text{new}} h_i > 0 and sioldhi<0s_i^{\text{old}} h_i < 0, giving ΔE<0\Delta E < 0. If sis_i does not change, ΔE=0\Delta E = 0. The state space {1,+1}n\{-1,+1\}^n is finite, so the process must terminate.

Why It Matters

This theorem guarantees that Hopfield networks always reach a stable state, making them useful as content-addressable memories. Present a corrupted version of a stored pattern, and the network dynamics will (often) recover the original.

Failure Mode

The capacity is limited. With Hebbian learning, reliable retrieval requires the number of stored patterns p0.14np \lesssim 0.14n. Beyond this threshold, spurious minima proliferate and retrieval fails. The patterns must also be roughly uncorrelated.

Boltzmann Machines

Definition

Boltzmann Machine

A Boltzmann machine replaces deterministic updates with stochastic ones. Each neuron ii takes state si=1s_i = 1 with probability:

P(si=1si)=σ(jwijsj+bi)P(s_i = 1 \mid \mathbf{s}_{\setminus i}) = \sigma\left(\sum_j w_{ij} s_j + b_i\right)

where σ(x)=1/(1+ex)\sigma(x) = 1/(1 + e^{-x}) is the logistic sigmoid. The network defines a joint distribution P(s)=1Zexp(E(s))P(\mathbf{s}) = \frac{1}{Z} \exp(-E(\mathbf{s})), which is a Gibbs distribution.

With hidden units h\mathbf{h} and visible units v\mathbf{v}, the model defines a marginal distribution P(v)=hP(v,h)P(\mathbf{v}) = \sum_{\mathbf{h}} P(\mathbf{v}, \mathbf{h}) over observed data. Learning maximizes the log-likelihood of training data by adjusting weights.

The gradient of the log-likelihood has a clean form:

logP(v)wij=sisjdatasisjmodel\frac{\partial \log P(\mathbf{v})}{\partial w_{ij}} = \langle s_i s_j \rangle_{\text{data}} - \langle s_i s_j \rangle_{\text{model}}

The first term ("positive phase") clamps visible units to data and samples hidden units. The second term ("negative phase") requires sampling from the full joint, which is intractable for general Boltzmann machines.

Restricted Boltzmann Machines

Definition

Restricted Boltzmann Machine (RBM)

An RBM restricts the connectivity to a bipartite graph: visible units v\mathbf{v} connect only to hidden units h\mathbf{h}, with no within-layer connections. This yields conditional independence:

P(hv)=jP(hjv),P(vh)=iP(vih)P(\mathbf{h} \mid \mathbf{v}) = \prod_j P(h_j \mid \mathbf{v}), \quad P(\mathbf{v} \mid \mathbf{h}) = \prod_i P(v_i \mid \mathbf{h})

Each conditional is a product of independent Bernoullis, enabling exact parallel sampling within each layer.

Proposition

Contrastive Divergence Approximation

Statement

Contrastive divergence with kk Gibbs steps (CD-kk) approximates the log-likelihood gradient by replacing the intractable model expectation with a kk-step Gibbs sample starting from the data:

Δwijvihjdatavihjrecon\Delta w_{ij} \approx \langle v_i h_j \rangle_{\text{data}} - \langle v_i h_j \rangle_{\text{recon}}

where "recon" denotes the reconstruction after kk alternating Gibbs steps.

Intuition

Instead of running Gibbs sampling to convergence (which could take exponentially long), CD-kk runs only kk steps starting from the training data. The data distribution is already close to the model distribution if the model is reasonable, so a few steps suffice to get a useful gradient signal.

Proof Sketch

Hinton (2002) showed that CD-kk follows the gradient of a different objective function (the difference of two KL divergences). It is not the exact log-likelihood gradient, but it moves in a direction that reduces a related divergence measure. Formal analysis by Carreira-Perpinan and Hinton (2005) showed the bias decreases as kk increases.

Why It Matters

CD-1 (a single Gibbs step) made RBM training practical. This enabled the deep belief network revolution (Hinton et al. 2006), where RBMs were stacked and trained greedily layer by layer. This was the first successful method for training deep generative models and led directly to the modern deep learning era.

Failure Mode

CD-kk is a biased estimator of the gradient. For small kk, the model may fail to learn long-range correlations. In practice, CD-1 often underestimates the negative phase, leading to models that assign probability mass to regions far from the data. Persistent contrastive divergence (PCD) partially addresses this by maintaining persistent Markov chains across parameter updates.

Common Confusions

Watch Out

Hopfield networks are not Boltzmann machines

Hopfield networks use deterministic updates and find local energy minima. Boltzmann machines use stochastic updates and sample from a Gibbs distribution. Adding temperature TT to a Hopfield network with P(si=1)=σ(hi/T)P(s_i = 1) = \sigma(h_i / T) turns it into a Boltzmann machine. At T0T \to 0, the Boltzmann machine recovers deterministic Hopfield dynamics.

Watch Out

RBMs are not deep models

A single RBM has one hidden layer. Deep belief networks (DBNs) stack multiple RBMs, training each greedily. The depth comes from stacking, not from the RBM itself.

Canonical Examples

Example

Pattern retrieval in a Hopfield network

Store two patterns in a 4-neuron Hopfield network: ξ1=(+1,+1,1,1)\boldsymbol{\xi}^1 = (+1,+1,-1,-1) and ξ2=(+1,1,+1,1)\boldsymbol{\xi}^2 = (+1,-1,+1,-1). The weight matrix (with Hebbian rule) is wij=14(ξi1ξj1+ξi2ξj2)w_{ij} = \frac{1}{4}(\xi_i^1 \xi_j^1 + \xi_i^2 \xi_j^2). Starting from the corrupted state (+1,+1,1,+1)(+1, +1, -1, +1) (one bit flipped from ξ1\boldsymbol{\xi}^1), asynchronous updates converge to ξ1\boldsymbol{\xi}^1, recovering the stored pattern.

Exercises

ExerciseCore

Problem

For a Hopfield network with n=100n = 100 neurons using Hebbian learning, approximately how many uncorrelated binary patterns can be reliably stored and retrieved?

ExerciseAdvanced

Problem

In an RBM with mm visible units and nn hidden units, how many parameters does the model have? Write the conditional P(hj=1v)P(h_j = 1 \mid \mathbf{v}) and explain why exact sampling from P(v,h)P(\mathbf{v}, \mathbf{h}) is tractable conditioned on one layer but intractable marginally.

References

Canonical:

  • Hopfield, "Neural networks and physical systems with emergent collective computational abilities" (1982)
  • Ackley, Hinton, Sejnowski, "A Learning Algorithm for Boltzmann Machines" (1985)
  • Hinton, "Training Products of Experts by Minimizing Contrastive Divergence" (2002)

Current:

  • Goodfellow, Bengio, Courville, Deep Learning (2016), Chapter 20

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics