Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Sampling MCMC

Gibbs Sampling

The workhorse MCMC algorithm for Bayesian models: sample each variable from its full conditional distribution, cycling through all variables, and every proposal is automatically accepted.

CoreTier 1Stable~60 min

Prerequisites

Why This Matters

Gibbs Sampling: Coordinate-wise Conditional SweepsVariables:x_1x_2x_3x_4Step 1x_1samplex_2fixedx_3fixedx_4fixedx_1 ~ p(x_1 | x_rest)Step 2x_1fixedx_2samplex_3fixedx_4fixedx_2 ~ p(x_2 | x_rest)Step 3x_1fixedx_2fixedx_3samplex_4fixedx_3 ~ p(x_3 | x_rest)Step 4x_1fixedx_2fixedx_3fixedx_4samplex_4 ~ p(x_4 | x_rest)2D Trajectory (bivariate case): staircase patternstartx_1x_2sample x_1sample x_2

Gibbs sampling is a foundational MCMC algorithm with a long history in applied Bayesian statistics. It powered earlier-generation tools like BUGS and JAGS, and remains standard whenever a model has full conditionally conjugate structure. Modern general-purpose Bayesian workflows more often reach for Hamiltonian Monte Carlo (Stan defaults to NUTS) because Gibbs mixes poorly in high-dimensional or strongly correlated posteriors. Gibbs still wins when the conditionals are closed-form: hierarchical models with conjugate priors, mixture models, LDA-style latent-variable models, and any setting where sampling from p(xixi)p(x_i \mid x_{-i}) is cheap.

Its appeal is simplicity: instead of designing a proposal distribution and tuning acceptance rates, you just sample each variable from its conditional distribution given the rest. No tuning parameters, no rejections.

Mental Model

Imagine you have a joint distribution over multiple variables (x1,x2,,xd)(x_1, x_2, \ldots, x_d). Sampling from the full joint distribution directly is hard. But sampling from x1x2,,xdx_1 \mid x_2, \ldots, x_d. the distribution of one variable given all others fixed. may be easy.

Gibbs sampling exploits this: cycle through the variables one at a time, updating each from its full conditional. This local updating strategy produces a Markov chain that converges to the full joint distribution.

Formal Setup and Notation

Let π(x1,x2,,xd)\pi(x_1, x_2, \ldots, x_d) be the target joint distribution. We assume we can sample from each full conditional distribution.

Definition

Full Conditional Distribution

The full conditional distribution of variable xix_i given all other variables is:

π(xixi)=π(xix1,,xi1,xi+1,,xd)=π(x1,,xd)π(x1,,xd)dxi\pi(x_i \mid x_{-i}) = \pi(x_i \mid x_1, \ldots, x_{i-1}, x_{i+1}, \ldots, x_d) = \frac{\pi(x_1, \ldots, x_d)}{\int \pi(x_1, \ldots, x_d)\, dx_i}

where xix_{-i} denotes all variables except xix_i. In practice, to derive the full conditional for xix_i, you write out the full joint density, treat everything that does not involve xix_i as a constant, and recognize the resulting kernel as a known distribution.

Definition

Gibbs Sampler (Systematic Scan)

Given target π(x1,,xd)\pi(x_1, \ldots, x_d) and initial state (x1(0),,xd(0))(x_1^{(0)}, \ldots, x_d^{(0)}):

At iteration tt, update each component in sequence:

  1. Draw x1(t)π(x1x2(t1),x3(t1),,xd(t1))x_1^{(t)} \sim \pi(x_1 \mid x_2^{(t-1)}, x_3^{(t-1)}, \ldots, x_d^{(t-1)})
  2. Draw x2(t)π(x2x1(t),x3(t1),,xd(t1))x_2^{(t)} \sim \pi(x_2 \mid x_1^{(t)}, x_3^{(t-1)}, \ldots, x_d^{(t-1)})
  3. Continue through all variables...
  4. Draw xd(t)π(xdx1(t),x2(t),,xd1(t))x_d^{(t)} \sim \pi(x_d \mid x_1^{(t)}, x_2^{(t)}, \ldots, x_{d-1}^{(t)})

Note: each update uses the most recent values of all other variables.

Definition

Random Scan Gibbs Sampler

An alternative to systematic scan: at each iteration, choose an index i{1,,d}i \in \{1, \ldots, d\} uniformly at random (or with specified probabilities), and update only xix_i from π(xixi)\pi(x_i \mid x_{-i}).

Random scan has cleaner theoretical properties (it satisfies detailed balance directly), while systematic scan is more commonly used in practice because it updates all variables every iteration.

Gibbs as a Special Case of Metropolis-Hastings

The central theoretical result about Gibbs sampling is that it is a special case of MH where the acceptance probability is always 1.

Theorem

Gibbs Sampling Has Acceptance Probability 1

Statement

Consider a single Gibbs update for variable xix_i: propose xix_i' from π(xixi)\pi(x_i' \mid x_{-i}). This is equivalent to a Metropolis-Hastings step with proposal q(xixi,xi)=π(xixi)q(x_i' \mid x_i, x_{-i}) = \pi(x_i' \mid x_{-i}) and acceptance probability:

α=min ⁣(1,  π(xi,xi)π(xixi)π(xi,xi)π(xixi))=1\alpha = \min\!\left(1,\;\frac{\pi(x_i', x_{-i})\, \pi(x_i \mid x_{-i})}{\pi(x_i, x_{-i})\, \pi(x_i' \mid x_{-i})}\right) = 1

Every proposed move is accepted.

Intuition

The proposal distribution is exactly the full conditional. The optimal distribution for xix_i given everything else. There is no mismatch between proposal and target to correct for, so the acceptance ratio evaluates to 1. You are proposing from exactly the right distribution.

Proof Sketch

By definition of the full conditional: π(xi,xi)=π(xixi)π(xi)\pi(x_i, x_{-i}) = \pi(x_i \mid x_{-i})\,\pi(x_{-i}).

The acceptance ratio becomes:

α=π(xixi)π(xi)π(xixi)π(xixi)π(xi)π(xixi)=π(xi)π(xi)=1\alpha = \frac{\pi(x_i' \mid x_{-i})\,\pi(x_{-i}) \cdot \pi(x_i \mid x_{-i})}{\pi(x_i \mid x_{-i})\,\pi(x_{-i}) \cdot \pi(x_i' \mid x_{-i})} = \frac{\pi(x_{-i})}{\pi(x_{-i})} = 1

The π(xixi)\pi(x_i \mid x_{-i}) and π(xi)\pi(x_{-i}) terms cancel perfectly.

Why It Matters

This result means Gibbs sampling inherits all the theoretical guarantees of MH (correctness via detailed balance, convergence via ergodicity) while avoiding the need to tune proposal distributions or deal with rejected samples. Every iteration produces a new state.

Failure Mode

The acceptance rate of 1 does not mean Gibbs sampling is always efficient. The chain can still mix slowly if the variables are highly correlated. When x1x_1 and x2x_2 are strongly correlated, updating one at a time while fixing the other leads to slow, diffusive exploration. The chain takes small steps along the correlation ridge. This is the main weakness of component-wise Gibbs sampling.

Theorem

Convergence of the Gibbs Sampler

Statement

Under mild regularity conditions, the Gibbs sampler is ergodic: starting from any initial state x(0)x^{(0)}, the distribution of x(t)x^{(t)} converges to π\pi in total variation:

limtPt(x(0),)π()TV=0\lim_{t \to \infty} \|P^t(x^{(0)}, \cdot) - \pi(\cdot)\|_{\text{TV}} = 0

For random scan, the chain satisfies detailed balance with respect to π\pi. For systematic scan, the chain satisfies the weaker condition of π\pi-invariance (but not necessarily detailed balance).

Intuition

Each Gibbs update preserves π\pi as the stationary distribution (since it is a valid MH step). As long as the chain can reach any state, guaranteed when the support is connected and the full conditionals have positive density, the chain converges. Random scan gives reversibility automatically; systematic scan gives faster practical convergence but sacrifices reversibility.

Proof Sketch

For random scan: the transition kernel is a mixture of MH kernels, each satisfying detailed balance, so the mixture does too. Irreducibility follows from the positivity of full conditionals on the support.

For systematic scan: the transition kernel is a composition (not mixture) of MH kernels. Each kernel preserves π\pi, so the composition does too. Detailed balance may fail, but π\pi-invariance plus irreducibility and aperiodicity suffice for convergence.

Why It Matters

This theorem ensures that Gibbs sampling produces valid samples from the target distribution, justifying its use in Bayesian computation. The distinction between random and systematic scan matters for theoretical analysis (e.g., proving CLTs for MCMC estimators) but rarely affects practice.

Failure Mode

Convergence can be extremely slow when variables are strongly correlated or when the posterior has multiple well-separated modes. In the latter case, the chain may get trapped in one mode for a very long time. Blocking (grouping correlated variables and updating them jointly) can help with correlations; tempering or other advanced methods are needed for multimodality.

Blocking for Correlated Variables

When variables xix_i and xjx_j are highly correlated, updating them one at a time leads to slow mixing. Blocking (or collapsed Gibbs) groups correlated variables together and samples them jointly:

Instead of updating x1x2,x3x_1 \mid x_2, x_3 then x2x1,x3x_2 \mid x_1, x_3, update (x1,x2)x3(x_1, x_2) \mid x_3 as a block. This requires sampling from the joint conditional π(x1,x2x3)\pi(x_1, x_2 \mid x_3), which may itself require specialized methods, but the resulting chain mixes much faster when x1x_1 and x2x_2 are correlated.

The general principle: block together variables that are strongly correlated a posteriori.

Gibbs for Conjugate Models

The power of Gibbs sampling is most apparent in conjugate models, where each full conditional belongs to a standard distribution family.

Example

Normal-Normal model (known variance)

Model: xiμN(μ,σ2)x_i \mid \mu \sim \mathcal{N}(\mu, \sigma^2) for i=1,,ni=1,\ldots,n, with prior μN(μ0,τ2)\mu \sim \mathcal{N}(\mu_0, \tau^2).

The full conditional for μ\mu is:

μx1,,xnN ⁣(nxˉ/σ2+μ0/τ2n/σ2+1/τ2,  1n/σ2+1/τ2)\mu \mid x_1, \ldots, x_n \sim \mathcal{N}\!\left(\frac{n\bar{x}/\sigma^2 + \mu_0/\tau^2}{n/\sigma^2 + 1/\tau^2},\; \frac{1}{n/\sigma^2 + 1/\tau^2}\right)

This is a one-block model, so one Gibbs draw gives an exact posterior sample. The power of Gibbs becomes clear in hierarchical extensions with additional latent variables.

Example

Beta-Binomial model

Model: xθBinomial(n,θ)x \mid \theta \sim \text{Binomial}(n, \theta), with prior θBeta(α,β)\theta \sim \text{Beta}(\alpha, \beta).

The full conditional is:

θxBeta(α+x,β+nx)\theta \mid x \sim \text{Beta}(\alpha + x, \beta + n - x)

Again conjugacy gives a closed-form full conditional. In a hierarchical model with multiple groups sharing a common prior, the Gibbs sampler alternates between updating group-level parameters and the hyperparameters.

Example

Gibbs for a bivariate normal

Target: (x1,x2)N ⁣((00),(1ρρ1))(x_1, x_2) \sim \mathcal{N}\!\left(\begin{pmatrix}0\\0\end{pmatrix}, \begin{pmatrix}1 & \rho\\ \rho & 1\end{pmatrix}\right).

The full conditionals are:

x1x2N(ρx2,  1ρ2)x_1 \mid x_2 \sim \mathcal{N}(\rho\, x_2,\; 1 - \rho^2) x2x1N(ρx1,  1ρ2)x_2 \mid x_1 \sim \mathcal{N}(\rho\, x_1,\; 1 - \rho^2)

When ρ|\rho| is close to 1, the chain moves in small steps along the diagonal, illustrating the slow-mixing problem for correlated variables. The autocorrelation of the chain is approximately ρ2\rho^2 per iteration.

Common Confusions

Watch Out

Gibbs requires KNOWN full conditionals

Gibbs sampling requires that you can derive and sample from each full conditional distribution. If the full conditional does not have a recognizable closed form, you cannot use standard Gibbs for that variable. The solution is MH within Gibbs: use a Metropolis-Hastings step (with some proposal distribution) to update that variable, while using exact Gibbs updates for the variables where full conditionals are known. This hybrid approach is extremely common in practice.

Watch Out

Gibbs is not always better than MH

The 100% acceptance rate of Gibbs does not mean it is always more efficient than MH. In high-correlation scenarios, a well-tuned MH proposal that accounts for correlations (or Hamiltonian Monte Carlo, which uses gradient information) can vastly outperform component-wise Gibbs. Gibbs excels when the model has conditionally conjugate structure and moderate correlations.

Watch Out

Systematic scan Gibbs does not satisfy detailed balance

A common misconception is that Gibbs sampling always satisfies detailed balance. In fact, only random scan Gibbs satisfies detailed balance. Systematic scan Gibbs, where you cycle through variables in a fixed order, satisfies the weaker property of π\pi-invariance: πK=π\pi K = \pi where KK is the transition kernel. This is sufficient for convergence but means some theoretical tools (e.g., certain CLT results) require more care.

Watch Out

Gibbs updates use the MOST RECENT values

In systematic scan, when updating xix_i, you condition on the values of x1,,xi1x_1, \ldots, x_{i-1} that have already been updated in the current iteration, and the values of xi+1,,xdx_{i+1}, \ldots, x_d from the previous iteration. This is not the same as conditioning on all values from the previous iteration (which would be a different, valid but less efficient, algorithm sometimes called "synchronous" or "parallel" Gibbs).

Summary

  • Gibbs sampling updates each variable from its full conditional, cycling through all variables
  • It is a special case of MH with acceptance probability 1
  • Requires known, tractable full conditional distributions
  • Random scan satisfies detailed balance; systematic scan only satisfies π\pi-invariance
  • Excels in conditionally conjugate models (normal-normal, beta-binomial)
  • Slow mixing when variables are highly correlated. use blocking
  • When full conditionals are not tractable, use MH within Gibbs

Exercises

ExerciseCore

Problem

Derive the Gibbs sampler for the bivariate normal distribution:

(x1,x2)N ⁣((μ1μ2),(σ12ρσ1σ2ρσ1σ2σ22))(x_1, x_2) \sim \mathcal{N}\!\left(\begin{pmatrix}\mu_1\\\mu_2\end{pmatrix}, \begin{pmatrix}\sigma_1^2 & \rho\sigma_1\sigma_2\\ \rho\sigma_1\sigma_2 & \sigma_2^2\end{pmatrix}\right)

Find both full conditional distributions π(x1x2)\pi(x_1 \mid x_2) and π(x2x1)\pi(x_2 \mid x_1).

ExerciseCore

Problem

Prove that the Gibbs update for variable xix_i, viewed as an MH step with proposal q(xixi,xi)=π(xixi)q(x_i' \mid x_i, x_{-i}) = \pi(x_i' \mid x_{-i}), has acceptance probability exactly 1.

ExerciseAdvanced

Problem

Consider a mixture model: xizi,μ1,μ2N(μzi,1)x_i \mid z_i, \mu_1, \mu_2 \sim \mathcal{N}(\mu_{z_i}, 1) with zi{1,2}z_i \in \{1, 2\}, P(zi=1)=0.5P(z_i = 1) = 0.5, and priors μkN(0,10)\mu_k \sim \mathcal{N}(0, 10). Derive the full conditional distributions for the Gibbs sampler: π(zirest)\pi(z_i \mid \text{rest}) and π(μkrest)\pi(\mu_k \mid \text{rest}).

References

Canonical:

  • Geman & Geman (1984), "Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images"
  • Gelfand & Smith (1990), "Sampling-Based Approaches to Calculating Marginal Densities"

Current:

  • Robert & Casella, Monte Carlo Statistical Methods (2004), Chapter 10

  • Gelman, Carlin, Stern, Dunson, Vehtari, Rubin, Bayesian Data Analysis (3rd ed., 2013), Chapter 11

  • Brooks et al., Handbook of MCMC (2011), Chapters 1-5

Next Topics

The natural next steps from Gibbs sampling:

  • Griddy Gibbs: approximate Gibbs when full conditionals are not available in closed form
  • MCMC convergence diagnostics: how to assess whether the chain has converged

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics