Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Sampling MCMC

MCMC for Markov Random Fields

Gibbs sampling on undirected graphical models. The joint distribution factorizes over cliques, and each variable is sampled conditioned on its Markov blanket. Ising model, image denoising, and simulated annealing.

AdvancedTier 3Stable~45 min

Prerequisites

0

Why This Matters

Markov random fields (MRFs) are undirected graphical models that encode conditional independence through graph structure. They appear throughout ML and statistics: spatial statistics, image analysis, natural language processing, and physics. The joint distribution of an MRF factorizes over cliques of the graph, but computing marginals or the normalizing constant is typically intractable (it requires summing over exponentially many configurations). Gibbs sampling provides a practical approach: sample each variable conditioned on its neighbors, and iterate. The Markov blanket property of MRFs makes each conditional tractable, even when the full joint is not.

Mental Model

Think of a grid of pixels, each with a label (e.g., black or white). Neighboring pixels tend to have the same label. The MRF encodes this preference through potential functions on edges: pairs of pixels with the same label get high potential, different labels get low potential. The joint probability is proportional to the product of all pairwise potentials. To sample from this distribution, pick a pixel, look at its neighbors, and resample that pixel's label from the conditional distribution given the neighbors. Repeat for all pixels, many times. This is Gibbs sampling on an MRF.

Formal Setup

Definition

Markov Random Field

A Markov random field consists of a set of random variables X=(X1,,Xn)X = (X_1, \ldots, X_n) associated with the nodes of an undirected graph G=(V,E)G = (V, E), satisfying the local Markov property: XiXVnb(i){i}Xnb(i)X_i \perp X_{V \setminus \text{nb}(i) \setminus \{i\}} \mid X_{\text{nb}(i)} where nb(i)\text{nb}(i) denotes the neighbors of node ii in GG. Each variable is conditionally independent of all non-neighbors given its neighbors (its Markov blanket).

Definition

Gibbs Distribution on a Graph

A Gibbs distribution with respect to graph GG takes the form:

P(x)=1ZCCψC(xC)P(x) = \frac{1}{Z} \prod_{C \in \mathcal{C}} \psi_C(x_C)

where C\mathcal{C} is the set of cliques of GG, ψC(xC)0\psi_C(x_C) \geq 0 are potential functions defined on clique configurations, and Z=xCψC(xC)Z = \sum_x \prod_C \psi_C(x_C) is the partition function (normalizing constant).

The Hammersley-Clifford theorem connects these two definitions.

Core Theory

Theorem

Hammersley-Clifford Theorem

Statement

If P(x)>0P(x) > 0 for all xx and PP satisfies the local Markov property with respect to an undirected graph GG, then PP factorizes as a Gibbs distribution over the cliques of GG:

P(x)=1ZCCψC(xC)P(x) = \frac{1}{Z} \prod_{C \in \mathcal{C}} \psi_C(x_C)

Conversely, any Gibbs distribution over GG satisfies the local Markov property with respect to GG.

Intuition

The Markov property says that the distribution has a certain conditional independence structure. The theorem says this is equivalent to the distribution factorizing into local pieces (potential functions on cliques). This equivalence is what makes MRFs computationally useful: the conditional independence structure implies that the full joint decomposes into small, manageable factors.

Proof Sketch

The reverse direction is straightforward: compute the conditional P(xixi)P(x_i \mid x_{-i}) from the clique factorization and verify that only terms involving neighbors of ii remain.

The forward direction is more involved. Assume the positivity condition. Define ϕC(xC)=AC(1)CAlogP(xA,xVA0)\phi_C(x_C) = \sum_{A \subseteq C} (-1)^{|C| - |A|} \log P(x_A, x^0_{V \setminus A}) using a Mobius inversion formula, where x0x^0 is a fixed reference configuration. The Markov property implies that ϕC=0\phi_C = 0 whenever CC is not a clique (because the conditional independence forces certain terms to cancel). Then logP(x)=CϕC(xC)+const\log P(x) = \sum_C \phi_C(x_C) + \text{const}, which gives the clique factorization.

Why It Matters

This theorem justifies the standard MRF modeling approach: specify local potentials on cliques, and the resulting distribution automatically satisfies the desired conditional independence structure. It also justifies Gibbs sampling: the factorization ensures that each full conditional P(xixi)P(x_i \mid x_{-i}) depends only on the neighbors, making each Gibbs update local and cheap.

Failure Mode

The positivity condition (P(x)>0P(x) > 0 for all xx) is necessary. Without it, the equivalence can fail: a distribution can satisfy the local Markov property but not factorize over cliques. In practice, most MRF models include a temperature parameter or smoothing that ensures strict positivity, so this condition is rarely a practical concern.

Gibbs Sampling on MRFs

Gibbs sampling on an MRF iterates:

  1. Select a node ii (systematically or randomly)
  2. Sample XiP(XiXnb(i))X_i \sim P(X_i \mid X_{\text{nb}(i)}), the conditional distribution given the current values of all neighbors

The full conditional is:

P(xixi)=P(xixnb(i))CiψC(xC)P(x_i \mid x_{-i}) = P(x_i \mid x_{\text{nb}(i)}) \propto \prod_{C \ni i} \psi_C(x_C)

Only potential functions involving node ii appear. For pairwise MRFs (where the largest cliques are edges), this is a product over edges incident to ii.

Computational cost per update. Each Gibbs update requires computing a product over the cliques containing node ii. For a pairwise MRF on a grid, each node has 4 neighbors, so each update involves 4 pairwise potentials. This is O(degree)O(\text{degree}) per update, independent of the graph size.

The Ising Model

Definition

Ising Model

The Ising model on graph G=(V,E)G = (V, E) assigns binary variables xi{1,+1}x_i \in \{-1, +1\} to each node, with joint distribution:

P(x)=1Zexp(β(i,j)Exixj+hiVxi)P(x) = \frac{1}{Z} \exp\left(\beta \sum_{(i,j) \in E} x_i x_j + h \sum_{i \in V} x_i\right)

where β>0\beta > 0 is the inverse temperature (coupling strength) and hh is the external field. When β\beta is large, neighboring nodes strongly prefer to agree. The Ising model is the canonical pairwise binary MRF.

The Gibbs update for the Ising model at node ii is:

P(xi=+1xnb(i))=σ(2βjnb(i)xj+2h)P(x_i = +1 \mid x_{\text{nb}(i)}) = \sigma\left(2\beta \sum_{j \in \text{nb}(i)} x_j + 2h\right)

where σ\sigma is the logistic function. This is a logistic regression of xix_i on the sum of its neighbors.

Phase transition. On infinite regular lattices, the Ising model exhibits a phase transition at a critical temperature βc\beta_c. For β<βc\beta < \beta_c, the system is disordered (roughly equal numbers of +1+1 and 1-1). For β>βc\beta > \beta_c, spontaneous magnetization occurs (most nodes align). Near the critical temperature, Gibbs sampling mixes extremely slowly because large clusters of aligned spins form and are difficult to flip one node at a time. This is critical slowing down.

Applications

Image denoising. Model a noisy image as an MRF: each pixel has an observed noisy value yiy_i and an unknown true value xix_i. The data term ψi(xi,yi)\psi_i(x_i, y_i) favors xix_i close to yiy_i. The smoothness prior ψij(xi,xj)\psi_{ij}(x_i, x_j) favors neighboring pixels to have similar values. Gibbs sampling on the posterior P(xy)P(x \mid y) produces denoised samples. The MAP estimate (mode of the posterior) can be found by simulated annealing.

Texture synthesis. Learn MRF potentials from a texture example, then sample new textures by running Gibbs sampling with the learned potentials. The Markov property ensures that local statistics of the synthesized texture match the original.

Spatial statistics. Model disease incidence, soil composition, or vegetation patterns over a geographic region. Neighboring regions are coupled through an MRF prior, allowing spatial smoothing while respecting the observed data.

Simulated Annealing on MRFs

To find the MAP configuration x=argmaxxP(x)x^* = \arg\max_x P(x), run Gibbs sampling at decreasing temperatures. Replace P(x)P(x) with PT(x)P(x)1/TP_T(x) \propto P(x)^{1/T} and decrease TT over time. At high temperature, the sampler explores freely. At low temperature, it concentrates on high-probability configurations.

If the temperature decreases as Tk=c/log(k+1)T_k = c / \log(k+1) for sufficiently large cc, simulated annealing converges to the global optimum with probability 1 (Geman and Geman, 1984). In practice, this cooling schedule is too slow, and faster geometric schedules Tk=T0αkT_k = T_0 \cdot \alpha^k are used at the cost of losing the global optimality guarantee.

Common Confusions

Watch Out

MRFs are not Bayesian networks

Bayesian networks are directed graphical models with conditional probability tables. MRFs are undirected with potential functions. The factorization structures are different: Bayesian networks factorize as products of conditionals, MRFs factorize as products of potentials with a partition function ZZ. Some independence structures can be represented by one but not the other.

Watch Out

The partition function Z does not need to be computed for Gibbs sampling

Gibbs sampling requires only the full conditional P(xixnb(i))P(x_i \mid x_{\text{nb}(i)}), which is a ratio of potentials. The partition function ZZ cancels in the ratio. This is a major advantage: ZZ is typically intractable (exponential sum), but Gibbs sampling sidesteps it entirely.

Watch Out

Slow mixing near phase transitions is not a bug in the algorithm

When the Ising model is near its critical temperature, Gibbs sampling takes exponentially many steps to mix. This is not a failure of Gibbs sampling; it reflects the genuine difficulty of the distribution. The distribution has long-range correlations that no local sampler can resolve quickly. Cluster methods (Swendsen-Wang, Wolff) can help by proposing collective updates.

Key Takeaways

  • MRFs are undirected graphical models where the joint distribution factorizes over cliques
  • Hammersley-Clifford: local Markov property is equivalent to clique factorization (under positivity)
  • Gibbs sampling on MRFs updates each variable conditioned on its Markov blanket (neighbors)
  • The partition function is never computed; Gibbs sampling only needs conditional ratios
  • The Ising model is the canonical binary MRF, with a phase transition that causes critical slowing down
  • Simulated annealing on MRFs finds MAP configurations by sampling at decreasing temperatures

Exercises

ExerciseCore

Problem

Consider a 3-node chain MRF with binary variables x1,x2,x3{0,1}x_1, x_2, x_3 \in \{0, 1\} and pairwise potentials ψ12(x1,x2)=exp(β1[x1=x2])\psi_{12}(x_1, x_2) = \exp(\beta \cdot \mathbf{1}[x_1 = x_2]) and ψ23(x2,x3)=exp(β1[x2=x3])\psi_{23}(x_2, x_3) = \exp(\beta \cdot \mathbf{1}[x_2 = x_3]) with β=1\beta = 1. Compute the Gibbs conditional P(x2=1x1=1,x3=0)P(x_2 = 1 \mid x_1 = 1, x_3 = 0).

ExerciseAdvanced

Problem

For the Ising model on a complete graph KnK_n (all-to-all connections) with β>0\beta > 0 and h=0h = 0, show that the Gibbs conditional for node ii depends on the other nodes only through the sum Si=jixjS_{-i} = \sum_{j \neq i} x_j. This is the Curie-Weiss (mean-field) model. Compute P(xi=+1Si)P(x_i = +1 \mid S_{-i}) and explain what happens as nn \to \infty with β\beta replaced by β/n\beta/n.

References

Canonical:

  • Geman and Geman, Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images, IEEE T-PAMI (1984)
  • Kindermann and Snell, Markov Random Fields and Their Applications, AMS (1980)

Current:

  • Koller and Friedman, Probabilistic Graphical Models (2009), Chapters 4, 12

  • Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 19, 24

  • Robert & Casella, Monte Carlo Statistical Methods (2004), Chapters 3-7

  • Gelman et al., Bayesian Data Analysis (2013), Chapters 10-12

Next Topics

  • Variational inference on MRFs (belief propagation, mean-field approximation)
  • Swendsen-Wang and Wolff cluster algorithms for faster mixing

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.