MCMC for Markov Random Fields

Sneiderman, Robby

Sampling MCMC

MCMC for Markov Random Fields

Gibbs sampling on undirected graphical models. The joint distribution factorizes over cliques, each variable is resampled from its Markov blanket, and the real practical story is local conditional updates versus long-range dependence, critical slowing, and exact-sampling alternatives.

AdvancedTier 3StableSupporting~45 min

Prerequisites

Gibbs Sampling Perfect Sampling

Prereq Map

Learning position

Read this page in the graph.

sampling-mcmc | layer 3 | tier 3. This page has 2 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Perfect Sampling

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Markov random fields (MRFs) are undirected graphical models that encode conditional independence through graph structure. They appear throughout ML and statistics: spatial statistics, image analysis, natural language processing, and physics. The joint distribution of an MRF factorizes over cliques of the graph, but computing marginals or the normalizing constant is typically intractable (it requires summing over exponentially many configurations). Gibbs sampling provides a practical approach: sample each variable conditioned on its neighbors, and iterate. The Markov blanket property of MRFs makes each conditional tractable, even when the full joint is not.

Mental Model

Think of a grid of pixels, each with a label (e.g., black or white). Neighboring pixels tend to have the same label. The MRF encodes this preference through potential functions on edges: pairs of pixels with the same label get high potential, different labels get low potential. The joint probability is proportional to the product of all pairwise potentials. To sample from this distribution, pick a pixel, look at its neighbors, and resample that pixel's label from the conditional distribution given the neighbors. Repeat for all pixels, many times. This is Gibbs sampling on an MRF.

Formal Setup

Definition

Markov Random Field $(X, G, Φ)$

A Markov random field consists of a set of random variables $X = (X_1, \ldots, X_n)$ associated with the nodes of an undirected graph $G = (V, E)$ , satisfying the local Markov property: $X_i \perp X_{V \setminus \text{nb}(i) \setminus \{i\}} \mid X_{\text{nb}(i)}$ where $\text{nb}(i)$ denotes the neighbors of node $i$ in $G$ . Each variable is conditionally independent of all non-neighbors given its neighbors (its Markov blanket).

Definition

Gibbs Distribution on a Graph

A Gibbs distribution with respect to graph $G$ takes the form:

$P(x) = \frac{1}{Z} \prod_{C \in \mathcal{C}} \psi_C(x_C)$

where $\mathcal{C}$ is the set of cliques of $G$ , $\psi_C(x_C) \geq 0$ are potential functions defined on clique configurations, and $Z = \sum_x \prod_C \psi_C(x_C)$ is the partition function (normalizing constant).

The Hammersley-Clifford theorem connects these two definitions.

Core Theory

Theorem

Hammersley-Clifford Theorem

Statement

If $P(x) > 0$ for all $x$ and $P$ satisfies the local Markov property with respect to an undirected graph $G$ , then $P$ factorizes as a Gibbs distribution over the cliques of $G$ :

$P(x) = \frac{1}{Z} \prod_{C \in \mathcal{C}} \psi_C(x_C)$

Conversely, any Gibbs distribution over $G$ satisfies the local Markov property with respect to $G$ .

Intuition

The Markov property says that the distribution has a certain conditional independence structure. The theorem says this is equivalent to the distribution factorizing into local pieces (potential functions on cliques). This equivalence is what makes MRFs computationally useful: the conditional independence structure implies that the full joint decomposes into small, manageable factors.

Proof Sketch

The reverse direction is straightforward: compute the conditional $P(x_i \mid x_{-i})$ from the clique factorization and verify that only terms involving neighbors of $i$ remain.

The forward direction is more involved. Assume the positivity condition. Define $\phi_C(x_C) = \sum_{A \subseteq C} (-1)^{|C| - |A|} \log P(x_A, x^0_{V \setminus A})$ using a Mobius inversion formula, where $x^0$ is a fixed reference configuration. The Markov property implies that $\phi_C = 0$ whenever $C$ is not a clique (because the conditional independence forces certain terms to cancel). Then $\log P(x) = \sum_C \phi_C(x_C) + \text{const}$ , which gives the clique factorization.

Why It Matters

This theorem justifies the standard MRF modeling approach: specify local potentials on cliques, and the resulting distribution automatically satisfies the desired conditional independence structure. It also justifies Gibbs sampling: the factorization ensures that each full conditional $P(x_i \mid x_{-i})$ depends only on the neighbors, making each Gibbs update local and cheap.

Failure Mode

The positivity condition ( $P(x) > 0$ for all $x$ ) is necessary. Without it, the equivalence can fail: a distribution can satisfy the local Markov property but not factorize over cliques. In practice, most MRF models include a temperature parameter or smoothing that ensures strict positivity, so this condition is rarely a practical concern.

report a correction →

Gibbs Sampling on MRFs

Gibbs sampling on an MRF iterates:

Select a node $i$ (systematically or randomly)
Sample $X_i \sim P(X_i \mid X_{\text{nb}(i)})$ , the conditional distribution given the current values of all neighbors

The full conditional is:

$P(x_i \mid x_{-i}) = P(x_i \mid x_{\text{nb}(i)}) \propto \prod_{C \ni i} \psi_C(x_C)$

Only potential functions involving node $i$ appear. For pairwise MRFs (where the largest cliques are edges), this is a product over edges incident to $i$ .

Computational cost per update. Each Gibbs update requires computing a product over the cliques containing node $i$ . For a pairwise MRF on a grid, each node has 4 neighbors, so each update involves 4 pairwise potentials. This is $O(\text{degree})$ per update, independent of the graph size.

Where the Sampler Actually Struggles

The local conditional for a node in an MRF is often easy; the hard part is that the global geometry of the configuration space can still be awful. When correlations become long-range, changing one node at a time is exactly the wrong move.

Three regimes matter in practice:

Weak coupling: local Gibbs updates decorrelate rapidly; the blanket simplification is both cheap and effective.
Near-critical coupling: local moves are cheap but the chain mixes slowly because large coherent regions must rearrange through tiny boundary changes.
Low-temperature optimization: Gibbs is often repurposed inside simulated annealing, where local updates become a search heuristic rather than a faithful posterior sampler.

The Markov blanket tells you why Gibbs is possible. It does not tell you whether Gibbs is fast.

The Ising Model

Definition

Ising Model

The Ising model on graph $G = (V, E)$ assigns binary variables $x_i \in \{-1, +1\}$ to each node, with joint distribution:

$P(x) = \frac{1}{Z} \exp\left(\beta \sum_{(i,j) \in E} x_i x_j + h \sum_{i \in V} x_i\right)$

where $\beta > 0$ is the inverse temperature (coupling strength) and $h$ is the external field. When $\beta$ is large, neighboring nodes strongly prefer to agree. The Ising model is the canonical pairwise binary MRF.

The Gibbs update for the Ising model at node $i$ is:

$P(x_i = +1 \mid x_{\text{nb}(i)}) = \sigma\left(2\beta \sum_{j \in \text{nb}(i)} x_j + 2h\right)$

where $\sigma$ is the logistic function. This is a logistic regression of $x_i$ on the sum of its neighbors.

Phase transition. On infinite regular lattices, the Ising model exhibits a phase transition at a critical temperature $\beta_c$ . For $\beta < \beta_c$ , the system is disordered (roughly equal numbers of $+1$ and $-1$ ). For $\beta > \beta_c$ , spontaneous magnetization occurs (most nodes align). Near the critical temperature, Gibbs sampling mixes extremely slowly because large clusters of aligned spins form and are difficult to flip one node at a time. This is critical slowing down.

Beyond Single-Site Gibbs

Single-site Gibbs is the baseline because it is easy to derive and easy to implement. It is rarely the end of the story for serious MRF inference.

Checkerboard or block Gibbs updates non-adjacent subsets together and can reduce short-range autocorrelation on lattice models.
Cluster samplers such as Swendsen-Wang and Wolff propose collective flips of aligned regions; they exist precisely because critical slowing makes local single-spin updates inefficient.
Perfect sampling becomes possible on some monotone MRFs. For attractive Ising-type models, monotone coupling from the past can replace heuristic burn-in with an exact draw, though often at much higher computational cost.

Applications

Image denoising. Model a noisy image as an MRF: each pixel has an observed noisy value $y_i$ and an unknown true value $x_i$ . The data term $\psi_i(x_i, y_i)$ favors $x_i$ close to $y_i$ . The smoothness prior $\psi_{ij}(x_i, x_j)$ favors neighboring pixels to have similar values. Gibbs sampling on the posterior $P(x \mid y)$ produces denoised samples. The MAP estimate (mode of the posterior) can be found by simulated annealing.

Texture synthesis. Learn MRF potentials from a texture example, then sample new textures by running Gibbs sampling with the learned potentials. The Markov property ensures that local statistics of the synthesized texture match the original.

Spatial statistics. Model disease incidence, soil composition, or vegetation patterns over a geographic region. Neighboring regions are coupled through an MRF prior, allowing spatial smoothing while respecting the observed data.

Simulated Annealing on MRFs

To find the MAP configuration $x^* = \arg\max_x P(x)$ , run Gibbs sampling at decreasing temperatures. Replace $P(x)$ with $P_T(x) \propto P(x)^{1/T}$ and decrease $T$ over time. At high temperature, the sampler explores freely. At low temperature, it concentrates on high-probability configurations.

If the temperature decreases as $T_k = c / \log(k+1)$ for sufficiently large $c$ , simulated annealing converges to the global optimum with probability 1 (Geman and Geman, 1984). In practice, this cooling schedule is too slow, and faster geometric schedules $T_k = T_0 \cdot \alpha^k$ are used at the cost of losing the global optimality guarantee.

Common Confusions

Watch Out

MRFs are not Bayesian networks

Bayesian networks are directed graphical models with conditional probability tables. MRFs are undirected with potential functions. The factorization structures are different: Bayesian networks factorize as products of conditionals, MRFs factorize as products of potentials with a partition function $Z$ . Some independence structures can be represented by one but not the other.

Watch Out

The partition function Z does not need to be computed for Gibbs sampling

Gibbs sampling requires only the full conditional $P(x_i \mid x_{\text{nb}(i)})$ , which is a ratio of potentials. The partition function $Z$ cancels in the ratio. This is a major advantage: $Z$ is typically intractable (exponential sum), but Gibbs sampling sidesteps it entirely.

Watch Out

Slow mixing near phase transitions is not a bug in the algorithm

When the Ising model is near its critical temperature, Gibbs sampling takes exponentially many steps to mix. This is not a failure of Gibbs sampling; it reflects the genuine difficulty of the distribution. The distribution has long-range correlations that no local sampler can resolve quickly. Cluster methods (Swendsen-Wang, Wolff) can help by proposing collective updates.

Watch Out

Markov blanket local does not mean inference easy

The fact that $P(x_i \mid x_{-i})$ depends only on neighbors is a structural simplification, not an efficiency guarantee. In strongly coupled MRFs, local updates can still require enormous time to decorrelate the full graph.

Summary

MRFs are undirected graphical models where the joint distribution factorizes over cliques
Hammersley-Clifford: local Markov property is equivalent to clique factorization (under positivity)
Gibbs sampling on MRFs updates each variable conditioned on its Markov blanket (neighbors)
The partition function is never computed; Gibbs sampling only needs conditional ratios
The Ising model is the canonical binary MRF, with a phase transition that causes critical slowing down
Local conditional simplicity does not imply fast global mixing
Simulated annealing on MRFs finds MAP configurations by sampling at decreasing temperatures

Exercises

ExerciseCore

Problem

Consider a 3-node chain MRF with binary variables $x_1, x_2, x_3 \in \{0, 1\}$ and pairwise potentials $\psi_{12}(x_1, x_2) = \exp(\beta \cdot \mathbf{1}[x_1 = x_2])$ and $\psi_{23}(x_2, x_3) = \exp(\beta \cdot \mathbf{1}[x_2 = x_3])$ with $\beta = 1$ . Compute the Gibbs conditional $P(x_2 = 1 \mid x_1 = 1, x_3 = 0)$ .

ExerciseAdvanced

Problem

For the Ising model on a complete graph $K_n$ (all-to-all connections) with $\beta > 0$ and $h = 0$ , show that the Gibbs conditional for node $i$ depends on the other nodes only through the sum $S_{-i} = \sum_{j \neq i} x_j$ . This is the Curie-Weiss (mean-field) model. Compute $P(x_i = +1 \mid S_{-i})$ and explain what happens as $n \to \infty$ with $\beta$ replaced by $\beta/n$ .

References

Canonical:

Geman and Geman, Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images, IEEE T-PAMI (1984)
Kindermann and Snell, Markov Random Fields and Their Applications, AMS (1980)

Current:

Koller and Friedman, Probabilistic Graphical Models (2009), Chapters 4, 12
Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 19, 24
Robert & Casella, Monte Carlo Statistical Methods (2004), Chapters 3-7
Gelman et al., Bayesian Data Analysis (2013), Chapters 10-12
Lauritzen, Graphical Models (1996), Chapters 3-4
Grimmett, The Random-Cluster Model (2006), Chapters 3-4 for Ising / cluster-sampler context

Next Topics

Perfect sampling: exact sampling via coupling from the past on monotone systems
Variational inference on MRFs (belief propagation, mean-field approximation)
Swendsen-Wang and Wolff cluster algorithms for faster mixing

Last reviewed: April 25, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Gibbs Samplinglayer 2 · tier 1
Perfect Samplinglayer 3 · tier 3

Derived topics

1

Burn-in and Convergence Diagnosticslayer 2 · tier 2

Graph-backed continuations

Burn-in and Convergence Diagnostics