Boltzmann Machines and Hopfield Networks

Sneiderman, Robby

ML Methods

Boltzmann Machines and Hopfield Networks

Energy-based models for associative memory and generative learning: Hopfield networks store patterns via energy minimization, Boltzmann machines add stochasticity and hidden units, and RBMs enable tractable learning through contrastive divergence.

CoreTier 3StableSupporting~50 min

Prerequisites

Common Probability Distributions

Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 2 | tier 3. This page has 1 direct prerequisite and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Autoencoders

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Before backpropagation dominated, energy-based models were the primary framework for neural computation. Hopfield networks (1982) showed that a network of binary neurons with symmetric weights could store and retrieve patterns by minimizing an energy function. Boltzmann machines (1985) extended this idea with stochastic units and hidden variables, creating one of the first principled generative models. Restricted Boltzmann machines (RBMs) became practical building blocks for deep learning in the mid-2000s, preceding the current era of deep feedforward networks.

Understanding these models clarifies the connection between neural networks, statistical physics, and modern energy-based approaches.

Hopfield Networks

Definition

Hopfield Network $E (s)$

A Hopfield network is a fully connected network of $n$ binary neurons with states $s_i \in \{-1, +1\}$ , symmetric weights $w_{ij} = w_{ji}$ with $w_{ii} = 0$ , and energy function:

$E(\mathbf{s}) = -\frac{1}{2} \sum_{i \neq j} w_{ij} s_i s_j - \sum_i b_i s_i$

where $b_i$ are bias terms.

The update rule is asynchronous: pick a neuron $i$ at random and set $s_i \leftarrow \text{sign}(\sum_j w_{ij} s_j + b_i)$ . Each update either decreases the energy or leaves it unchanged.

Storing patterns. Given $p$ patterns $\boldsymbol{\xi}^1, \ldots, \boldsymbol{\xi}^p$ with $\xi_i^\mu \in \{-1, +1\}$ , the Hebbian learning rule sets:

$w_{ij} = \frac{1}{n} \sum_{\mu=1}^{p} \xi_i^\mu \xi_j^\mu$

Each stored pattern becomes a local minimum of the energy function.

Theorem

Hopfield Convergence Theorem

Statement

Under asynchronous updates, the energy $E(\mathbf{s})$ is non-increasing at each step. The network converges to a fixed point (local energy minimum) in finite time.

Intuition

Flipping neuron $i$ changes the energy by $\Delta E = -s_i^{\text{new}}(\sum_j w_{ij} s_j + b_i) + s_i^{\text{old}}(\sum_j w_{ij} s_j + b_i)$ . The update rule chooses $s_i^{\text{new}}$ to align with the local field, so $\Delta E \leq 0$ . Since the energy is bounded below and takes finitely many values, the process terminates.

Proof Sketch

Let $h_i = \sum_j w_{ij} s_j + b_i$ be the local field at neuron $i$ . The update sets $s_i \leftarrow \text{sign}(h_i)$ . The energy change is $\Delta E = -(s_i^{\text{new}} - s_i^{\text{old}}) h_i$ . If $s_i$ changes, then $s_i^{\text{new}} h_i > 0$ and $s_i^{\text{old}} h_i < 0$ , giving $\Delta E < 0$ . If $s_i$ does not change, $\Delta E = 0$ . The state space $\{-1,+1\}^n$ is finite, so the process must terminate.

Why It Matters

This theorem guarantees that Hopfield networks always reach a stable state, making them useful as content-addressable memories. Present a corrupted version of a stored pattern, and the network dynamics will (often) recover the original.

Failure Mode

The capacity is limited. With Hebbian learning, reliable retrieval requires the number of stored patterns $p \lesssim 0.14n$ . Beyond this threshold, spurious minima proliferate and retrieval fails. The patterns must also be roughly uncorrelated.

report a correction →

Boltzmann Machines

Definition

Boltzmann Machine

A Boltzmann machine replaces deterministic updates with stochastic ones. Each neuron $i$ takes state $s_i = 1$ with probability:

$P(s_i = 1 \mid \mathbf{s}_{\setminus i}) = \sigma\left(\sum_j w_{ij} s_j + b_i\right)$

where $\sigma(x) = 1/(1 + e^{-x})$ is the logistic sigmoid. The network defines a joint distribution $P(\mathbf{s}) = \frac{1}{Z} \exp(-E(\mathbf{s}))$ , which is a Gibbs distribution.

With hidden units $\mathbf{h}$ and visible units $\mathbf{v}$ , the model defines a marginal distribution $P(\mathbf{v}) = \sum_{\mathbf{h}} P(\mathbf{v}, \mathbf{h})$ over observed data. Learning maximizes the log-likelihood of training data by adjusting weights.

The gradient of the log-likelihood has a clean form:

$\frac{\partial \log P(\mathbf{v})}{\partial w_{ij}} = \langle s_i s_j \rangle_{\text{data}} - \langle s_i s_j \rangle_{\text{model}}$

The first term ("positive phase") clamps visible units to data and samples hidden units. The second term ("negative phase") requires sampling from the full joint, which is intractable for general Boltzmann machines.

Restricted Boltzmann Machines

Definition

Restricted Boltzmann Machine (RBM)

An RBM restricts the connectivity to a bipartite graph: visible units $\mathbf{v}$ connect only to hidden units $\mathbf{h}$ , with no within-layer connections. This yields conditional independence:

$P(\mathbf{h} \mid \mathbf{v}) = \prod_j P(h_j \mid \mathbf{v}), \quad P(\mathbf{v} \mid \mathbf{h}) = \prod_i P(v_i \mid \mathbf{h})$

Each conditional is a product of independent Bernoullis, enabling exact parallel sampling within each layer.

Proposition

Contrastive Divergence Approximation

Statement

Contrastive divergence with $k$ Gibbs steps (CD- $k$ ) approximates the log-likelihood gradient by replacing the intractable model expectation with a $k$ -step Gibbs sample starting from the data:

$\Delta w_{ij} \approx \langle v_i h_j \rangle_{\text{data}} - \langle v_i h_j \rangle_{\text{recon}}$

where "recon" denotes the reconstruction after $k$ alternating Gibbs steps.

Intuition

Instead of running Gibbs sampling to convergence (which could take exponentially long), CD- $k$ runs only $k$ steps starting from the training data. The data distribution is already close to the model distribution if the model is reasonable, so a few steps suffice to get a useful gradient signal.

Proof Sketch

Hinton (2002) showed that CD- $k$ follows the gradient of a different objective function (the difference of two KL divergences). It is not the exact log-likelihood gradient, but it moves in a direction that reduces a related divergence measure. Formal analysis by Carreira-Perpinan and Hinton (2005) showed the bias decreases as $k$ increases.

Why It Matters

CD-1 (a single Gibbs step) made RBM training practical. This enabled the deep belief network revolution (Hinton et al. 2006), where RBMs were stacked and trained greedily layer by layer. This was the first successful method for training deep generative models and led directly to the modern deep learning era.

Failure Mode

CD- $k$ is a biased estimator of the gradient. For small $k$ , the model may fail to learn long-range correlations. In practice, CD-1 often underestimates the negative phase, leading to models that assign probability mass to regions far from the data. Persistent contrastive divergence (PCD) partially addresses this by maintaining persistent Markov chains across parameter updates.

report a correction →

Common Confusions

Watch Out

Hopfield networks are not Boltzmann machines

Hopfield networks use deterministic updates and find local energy minima. Boltzmann machines use stochastic updates and sample from a Gibbs distribution. Adding temperature $T$ to a Hopfield network with $P(s_i = 1) = \sigma(h_i / T)$ turns it into a Boltzmann machine. At $T \to 0$ , the Boltzmann machine recovers deterministic Hopfield dynamics.

Watch Out

RBMs are not deep models

A single RBM has one hidden layer. Deep belief networks (DBNs) stack multiple RBMs, training each greedily. The depth comes from stacking, not from the RBM itself.

Canonical Examples

Example

Pattern retrieval in a Hopfield network

Store two patterns in a 4-neuron Hopfield network: $\boldsymbol{\xi}^1 = (+1,+1,-1,-1)$ and $\boldsymbol{\xi}^2 = (+1,-1,+1,-1)$ . The weight matrix (with Hebbian rule) is $w_{ij} = \frac{1}{4}(\xi_i^1 \xi_j^1 + \xi_i^2 \xi_j^2)$ . Starting from the corrupted state $(+1, +1, -1, +1)$ (one bit flipped from $\boldsymbol{\xi}^1$ ), asynchronous updates converge to $\boldsymbol{\xi}^1$ , recovering the stored pattern.

Exercises

ExerciseCore

Problem

For a Hopfield network with $n = 100$ neurons using Hebbian learning, approximately how many uncorrelated binary patterns can be reliably stored and retrieved?

ExerciseAdvanced

Problem

In an RBM with $m$ visible units and $n$ hidden units, how many parameters does the model have? Write the conditional $P(h_j = 1 \mid \mathbf{v})$ and explain why exact sampling from $P(\mathbf{v}, \mathbf{h})$ is tractable conditioned on one layer but intractable marginally.

References

Canonical:

Hopfield, "Neural networks and physical systems with emergent collective computational abilities" (1982)
Ackley, Hinton, Sejnowski, "A Learning Algorithm for Boltzmann Machines" (1985)
Hinton, "Training Products of Experts by Minimizing Contrastive Divergence" (2002)

Current:

Goodfellow, Bengio, Courville, Deep Learning (2016), Chapter 20

Next Topics

Autoencoders: another approach to learning representations with hidden variables
Diffusion models: modern energy-based generative models

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Common Probability Distributionslayer 0A · tier 1

Derived topics

2

Diffusion Modelslayer 4 · tier 1
Autoencoderslayer 2 · tier 2

Graph-backed continuations

Autoencoders Diffusion Models