Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Sampling MCMC

Hamiltonian Monte Carlo

HMC uses gradient information and Hamiltonian dynamics to propose large, distant moves in parameter space that are still accepted with high probability: dramatically outperforming random-walk Metropolis-Hastings in high dimensions.

AdvancedTier 2Stable~60 min

Prerequisites

0

Why This Matters

Metropolis-Hastings (MH) with a random walk proposal is the simplest MCMC algorithm, but it fails in high dimensions. The random walk takes tiny steps to maintain a reasonable acceptance rate, resulting in extremely slow exploration. In dd dimensions, the mixing time of random-walk MH scales as O(d2)O(d^2). For even moderate dd, this is prohibitively slow.

Hamiltonian Monte Carlo (HMC) solves this by using gradient information to propose distant points in parameter space that are still likely to be accepted. HMC's mixing time scales as O(d5/4)O(d^{5/4}). a dramatic improvement that makes Bayesian inference practical in hundreds or thousands of dimensions.

HMC is the engine behind Stan, one of the most widely used probabilistic programming languages, and is the default MCMC algorithm for serious Bayesian modeling.

Mental Model

Imagine a ball rolling on a surface whose height at position qq is the negative log-posterior U(q)=logp(qdata)U(q) = -\log p(q | \text{data}). The ball naturally rolls toward regions of high posterior density (low UU) and maintains momentum to traverse ridges between modes.

HMC simulates this physics. Instead of randomly proposing a new point, it gives the current point a random "kick" (momentum) and then follows the laws of physics (Hamilton's equations) for a fixed time. The endpoint of this trajectory is the proposal. Because the trajectory follows the posterior landscape, the proposal tends to be in a region of similar density. even if it is far from the starting point.

Formal Setup and Notation

We want to sample from a target distribution with density π(q)eU(q)\pi(q) \propto e^{-U(q)} where qRdq \in \mathbb{R}^d are the parameters and U(q)=logπ(q)U(q) = -\log \pi(q) is the potential energy (up to a constant).

HMC introduces an auxiliary momentum variable pRdp \in \mathbb{R}^d and defines the joint distribution:

π(q,p)exp(H(q,p))\pi(q, p) \propto \exp(-H(q, p))

where the Hamiltonian is:

H(q,p)=U(q)+K(p)H(q, p) = U(q) + K(p)

The kinetic energy is typically K(p)=12pTM1pK(p) = \frac{1}{2}p^T M^{-1} p, where MM is a mass matrix (often M=IM = I). The marginal distribution of qq is the target π(q)\pi(q), and pqN(0,M)p | q \sim \mathcal{N}(0, M).

Core Definitions

Definition

Hamilton's Equations

The Hamiltonian dynamics are governed by:

dqdt=Hp=M1p,dpdt=Hq=U(q)\frac{dq}{dt} = \frac{\partial H}{\partial p} = M^{-1} p, \qquad \frac{dp}{dt} = -\frac{\partial H}{\partial q} = -\nabla U(q)

These equations describe how position qq and momentum pp evolve over time. The key property: the Hamiltonian H(q,p)H(q, p) is conserved along the trajectory. A particle moving along this trajectory stays on a contour of constant HH.

Definition

Leapfrog Integrator

Since we cannot solve Hamilton's equations analytically for general UU, we use a numerical integrator. The leapfrog (Stormer-Verlet) integrator with step size ϵ\epsilon:

pt+ϵ/2=ptϵ2U(qt)p_{t+\epsilon/2} = p_t - \frac{\epsilon}{2}\nabla U(q_t) qt+ϵ=qt+ϵM1pt+ϵ/2q_{t+\epsilon} = q_t + \epsilon \, M^{-1} p_{t+\epsilon/2} pt+ϵ=pt+ϵ/2ϵ2U(qt+ϵ)p_{t+\epsilon} = p_{t+\epsilon/2} - \frac{\epsilon}{2}\nabla U(q_{t+\epsilon})

This is repeated LL times to produce the proposal. Each step requires one gradient evaluation U(q)\nabla U(q).

Definition

HMC Algorithm

One iteration of HMC:

  1. Resample momentum: Draw pN(0,M)p \sim \mathcal{N}(0, M)
  2. Simulate dynamics: Starting from (q,p)(q, p), run LL leapfrog steps with step size ϵ\epsilon to get (q,p)(q^*, p^*)
  3. Negate momentum: Set ppp^* \leftarrow -p^* (for reversibility)
  4. Metropolis accept/reject: Accept (q,p)(q^*, p^*) with probability min(1,exp(H(q,p)+H(q,p)))\min(1, \exp(-H(q^*, p^*) + H(q, p)))

If the leapfrog integrator were exact (zero discretization error), the Hamiltonian would be perfectly conserved and the acceptance probability would be 1. In practice, the acceptance rate depends on ϵ\epsilon and LL.

Main Theorems

Theorem

Conservation Properties of Hamiltonian Dynamics

Statement

The exact Hamiltonian flow has three crucial properties:

  1. Energy conservation: H(q(t),p(t))=H(q(0),p(0))H(q(t), p(t)) = H(q(0), p(0)) for all tt.
  2. Volume preservation (Liouville's theorem): The flow preserves the Lebesgue measure on phase space (q,p)(q, p).
  3. Reversibility: If (q(T),p(T))(q(T), -p(T)) is evolved for time TT, it returns to (q(0),p(0))(q(0), -p(0)).

Together, these ensure that the dynamics leave the joint distribution π(q,p)eH(q,p)\pi(q, p) \propto e^{-H(q,p)} invariant.

Intuition

Energy conservation means the proposal has the same Hamiltonian value as the starting point, so acceptance probability is 1. Volume preservation means we do not need a Jacobian correction in the acceptance probability. Reversibility ensures detailed balance. These three properties are exactly what we need for a valid MCMC proposal.

Why It Matters

These properties explain why HMC works. The Hamiltonian flow is a measure-preserving, reversible map on phase space that conserves energy. It is the ideal MCMC proposal: it moves you to a distant point with the same target density, guaranteed. The only reason we need the Metropolis-Hastings correction is that the leapfrog integrator introduces discretization error, breaking exact energy conservation.

Proposition

Leapfrog Preserves Volume

Statement

The leapfrog integrator is a symplectic integrator: it preserves the symplectic 2-form on phase space. In particular, it is volume-preserving (the Jacobian of the leapfrog map has determinant 1) and time-reversible.

The discretization error in the Hamiltonian is O(ϵ2)O(\epsilon^2) per step and O(ϵ2)O(\epsilon^2) globally (the error does not accumulate over steps).

Intuition

The leapfrog integrator inherits the volume-preservation and reversibility of exact Hamiltonian dynamics, even though it does not conserve energy exactly. This is why leapfrog is used instead of, say, Euler's method: Euler's method is neither volume-preserving nor reversible, and would require a Jacobian correction that is expensive to compute.

The non-accumulation of energy error is the key miracle: after LL steps, the Hamiltonian error is still O(ϵ2)O(\epsilon^2), not O(Lϵ2)O(L\epsilon^2). This is because the leapfrog trajectory stays on a "shadow Hamiltonian" that is O(ϵ2)O(\epsilon^2)-close to the true Hamiltonian.

Why It Matters

Volume preservation means the Metropolis acceptance probability is simply min(1,exp(ΔH))\min(1, \exp(-\Delta H)) with no Jacobian term. Reversibility ensures detailed balance. The O(ϵ2)O(\epsilon^2) global error means we can take large steps (large ϵ\epsilon) while maintaining high acceptance rates, which is exactly what makes HMC efficient.

Proposition

HMC Satisfies Detailed Balance

Statement

The HMC transition kernel satisfies detailed balance with respect to the joint distribution π(q,p)eH(q,p)\pi(q, p) \propto e^{-H(q,p)}. Therefore, π(q)\pi(q) is a stationary distribution of the HMC Markov chain (marginalizing over pp).

Intuition

The momentum resampling step leaves π(q,p)\pi(q,p) invariant because pp is drawn from its conditional N(0,M)\mathcal{N}(0, M). The leapfrog + Metropolis step leaves π(q,p)\pi(q,p) invariant because leapfrog is volume-preserving and reversible, and the Metropolis correction accounts for the energy error. Composing two π\pi-invariant steps gives a π\pi-invariant chain.

Why HMC Beats Random-Walk MH

The fundamental problem with random-walk MH in dd dimensions:

  • To maintain acceptance rate 0.234\sim 0.234 (optimal), the step size must scale as O(1/d)O(1/\sqrt{d}).
  • The chain takes steps of size O(1/d)O(1/\sqrt{d}), so it needs O(d)O(d) steps to traverse one standard deviation of the target.
  • Total cost to get one effective sample: O(d2)O(d^2) density evaluations.

HMC with properly tuned ϵ\epsilon and LL:

  • The leapfrog trajectory can traverse the entire target distribution in one proposal.
  • The optimal step size scales as O(d1/4)O(d^{-1/4}).
  • Total cost per effective sample: O(d5/4)O(d^{5/4}) gradient evaluations.

The improvement from O(d2)O(d^2) to O(d5/4)O(d^{5/4}) is what makes HMC practical for high-dimensional Bayesian inference. The cost is requiring U(q)\nabla U(q), which is not always available.

MCMC Sampler Comparison

The following table compares HMC against other MCMC methods across the dimensions that matter for practical Bayesian inference.

PropertyRandom-walk MHGibbs samplingHMC (NUTS)Langevin MCMC
Cost per stepO(1)O(1) density evalO(d)O(d) conditional evalsO(L)O(L) gradient evalsO(1)O(1) gradient eval
Scaling with ddO(d2)O(d^2) for one effective sampleO(d2)O(d^2) (worse with correlation)O(d5/4)O(d^{5/4})O(d5/3)O(d^{5/3})
Requires gradientsNoNoYesYes
Handles correlationPoorly (slow mixing)Poorly (blocked Gibbs helps)Well (momentum carries through correlated dimensions)Moderately
Tuning parametersProposal scaleConditional distributionsϵ\epsilon, LL (NUTS auto-tunes both)Step size only
Optimal acceptance rate0.2341.0 (exact conditionals)0.65 (HMC) or 0.80 (NUTS)0.574
Multimodal targetsVery slowVery slowSlow (cannot jump between modes)Very slow

When HMC Fails

HMC is not a universal solution. Several target distribution geometries cause problems.

Multimodal targets. HMC explores a single mode efficiently but cannot jump between well-separated modes. The Hamiltonian dynamics are continuous, so the trajectory must pass through the low-density region between modes. If this region has very low density (high energy barrier), the trajectory would need enormous kinetic energy (momentum) to cross it, which is extremely unlikely under the Gaussian momentum distribution. For multimodal posteriors, parallel tempering or mixture proposals are needed alongside HMC.

Discontinuous densities. The leapfrog integrator requires U(q)\nabla U(q), which does not exist at density discontinuities. Models with discrete parameters, hard constraints, or indicator functions cannot be sampled directly with HMC. Reparameterization (replacing hard constraints with smooth approximations) is the standard workaround.

Funnel geometries. Neal's funnel is the canonical example: vN(0,32)v \sim \mathcal{N}(0, 3^2), xivN(0,ev)x_i | v \sim \mathcal{N}(0, e^v). When vv is large and negative, the xix_i are tightly concentrated near zero, requiring very small step sizes. When vv is large and positive, the xix_i spread widely, allowing large steps. No single step size works across the entire distribution. Non-centered parameterizations (writing the model in terms of standard normal variates and transforming) resolve this for many hierarchical models.

Very high dimensions without structure. At d>10,000d > 10{,}000, even HMC's O(d5/4)O(d^{5/4}) scaling becomes expensive. Each leapfrog step requires a full gradient computation over all parameters. For large neural networks, variational inference (which uses gradient descent to optimize an approximate posterior) is preferred over MCMC because its cost per iteration does not grow with the number of posterior samples needed.

NUTS: No-U-Turn Sampler

Definition

NUTS (No-U-Turn Sampler)

HMC has two tuning parameters: step size ϵ\epsilon and number of leapfrog steps LL. Choosing LL is difficult:

  • Too small LL: the trajectory does not explore enough, behaving like a random walk
  • Too large LL: the trajectory doubles back on itself, wasting computation

NUTS (Hoffman & Gelman, 2014) automatically selects LL by running the leapfrog trajectory until it makes a "U-turn". WHEN the trajectory starts moving back toward the starting point. Formally, it uses a doubling scheme: double the trajectory length until the criterion p(qforwardqbackward)<0p \cdot (q_{\text{forward}} - q_{\text{backward}}) < 0 is met.

NUTS also adapts ϵ\epsilon during warmup to achieve a target acceptance rate (typically 0.8). This makes HMC tuning-free.

NUTS is the default sampler in Stan and PyMC. In practice, you almost never tune HMC by hand. You use NUTS and let it adapt.

Connection to Symplectic Geometry

The leapfrog integrator is a symplectic integrator, meaning it preserves the symplectic 2-form ω=idqidpi\omega = \sum_i dq_i \wedge dp_i. Volume preservation is a consequence of symplecticity (by Liouville's theorem for symplectic maps).

Symplecticity is strictly stronger than volume preservation. It ensures that the leapfrog trajectory is qualitatively correct: it stays on a "shadow Hamiltonian" H~\tilde{H} that is O(ϵ2)O(\epsilon^2)-close to HH. Non-symplectic integrators (like Euler) can be volume-preserving but still exhibit energy drift, making them unsuitable for HMC.

This connection to symplectic geometry is not just theoretical elegance. It is the practical reason why HMC works. The mathematical structure of Hamiltonian systems guarantees that the leapfrog integrator has the exact properties needed for efficient MCMC.

Proof Ideas and Templates Used

The correctness of HMC uses:

  1. Liouville's theorem: Hamiltonian flow preserves phase space volume, so no Jacobian correction is needed.
  2. Detailed balance via reversibility: The momentum flip plus leapfrog reversibility ensures the proposal is its own inverse, satisfying detailed balance with the Metropolis correction.
  3. Backward error analysis: The leapfrog trajectory exactly solves a modified Hamiltonian H~=H+O(ϵ2)\tilde{H} = H + O(\epsilon^2), explaining why the energy error is bounded.

Canonical Examples

Example

Sampling from a correlated Gaussian

Consider π(q)=N(0,Σ)\pi(q) = \mathcal{N}(0, \Sigma) with Σ\Sigma having condition number κ\kappa. Random-walk MH takes O(κ2)O(\kappa^2) steps per effective sample. HMC with mass matrix M=Σ1M = \Sigma^{-1} (preconditioning) takes O(κ1/2)O(\kappa^{1/2}) leapfrog steps per effective sample. For κ=100\kappa = 100, this is the difference between 10,000 steps and 10 steps.

Example

Bayesian logistic regression

With d=100d = 100 parameters and n=1000n = 1000 data points, random-walk MH requires tens of thousands of iterations to converge. HMC (via Stan) typically converges in a few hundred iterations, each requiring L10L \approx 10--50 leapfrog steps. The gradient U(q)\nabla U(q) is the gradient of the negative log-posterior, computable in O(nd)O(nd) time.

Common Confusions

Watch Out

HMC requires gradients, not Hessians

HMC uses only first-order gradient information U(q)\nabla U(q), not the Hessian. The mass matrix MM can be set to an estimate of the posterior covariance (as NUTS does during warmup), but this is a preconditioning choice, not a requirement. Each leapfrog step costs one gradient evaluation, the same as one step of gradient descent.

Watch Out

The momentum is auxiliary and gets discarded

After each HMC step, the momentum pp is resampled from scratch. It is an auxiliary variable introduced solely to enable Hamiltonian dynamics. The samples of interest are the positions qq only. The momentum is never part of the output; it is a computational device.

Watch Out

HMC is not gradient descent

Although HMC uses gradients, it is structurally different from gradient descent. Gradient descent deterministically moves toward a mode. HMC uses gradients to navigate the entire posterior distribution, including regions far from the mode. The random momentum resampling ensures exploration, and the Metropolis correction ensures correctness. HMC samples from the posterior; gradient descent finds the MAP estimate.

Watch Out

The acceptance rate should not be 1.0

An acceptance rate of 1.0 means ϵ\epsilon is too small and the chain is taking unnecessarily short steps. The optimal acceptance rate for HMC is around 0.65 (for standard HMC) or 0.80 (for NUTS). Lower rates mean ϵ\epsilon is too large (too much discretization error); higher rates mean ϵ\epsilon is too small (too many wasted gradient evaluations).

Summary

  • HMC = Hamiltonian dynamics + leapfrog integrator + Metropolis correction
  • Position qq = parameters, momentum pp = auxiliary (Gaussian)
  • Hamiltonian H(q,p)=U(q)+K(p)H(q,p) = U(q) + K(p) where U=logπ(q)U = -\log\pi(q)
  • Leapfrog is symplectic: volume-preserving, reversible, O(ϵ2)O(\epsilon^2) energy error
  • Scaling: O(d5/4)O(d^{5/4}) vs. random walk MH's O(d2)O(d^2)
  • NUTS automatically tunes trajectory length LL and step size ϵ\epsilon
  • NUTS is the default in Stan and PyMC. use it for Bayesian inference

Exercises

ExerciseCore

Problem

Write out the three steps of one leapfrog iteration for U(q)=12qTΣ1qU(q) = \frac{1}{2}q^T\Sigma^{-1}q (Gaussian target) with M=IM = I. What is U(q)\nabla U(q)?

ExerciseAdvanced

Problem

Explain why Euler's method (qt+ϵ=qt+ϵptq_{t+\epsilon} = q_t + \epsilon p_t, pt+ϵ=ptϵU(qt)p_{t+\epsilon} = p_t - \epsilon \nabla U(q_t)) is unsuitable for HMC, even though it is simpler than leapfrog.

ExerciseAdvanced

Problem

In d=1000d = 1000 dimensions, estimate the relative cost (in gradient evaluations) of getting one effective sample from random-walk MH vs. HMC. Assume the target is a standard Gaussian.

References

Canonical:

  • Neal, "MCMC Using Hamiltonian Dynamics," Handbook of MCMC (2011)
  • Hoffman & Gelman, "The No-U-Turn Sampler" (NUTS, 2014)

Current:

  • Betancourt, "A Conceptual Introduction to Hamiltonian Monte Carlo" (2017)

  • Stan Development Team, Stan User's Guide

  • Robert & Casella, Monte Carlo Statistical Methods (2004), Chapters 3-7

  • Brooks et al., Handbook of MCMC (2011), Chapters 1-5

Next Topics

The natural next steps from HMC:

  • Riemannian HMC: adapting the mass matrix to local curvature
  • Variational inference: the optimization-based alternative to MCMC
  • Importance sampling: a non-MCMC approach to estimating expectations, useful when the target is easy to evaluate but hard to sample from directly

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.