Sampling MCMC
Hamiltonian Monte Carlo
HMC uses gradient information and Hamiltonian dynamics to propose large, distant moves in parameter space that are still accepted with high probability: dramatically outperforming random-walk Metropolis-Hastings in high dimensions.
Prerequisites
Why This Matters
Metropolis-Hastings (MH) with a random walk proposal is the simplest MCMC algorithm, but it fails in high dimensions. The random walk takes tiny steps to maintain a reasonable acceptance rate, resulting in extremely slow exploration. In dimensions, the mixing time of random-walk MH scales as . For even moderate , this is prohibitively slow.
Hamiltonian Monte Carlo (HMC) solves this by using gradient information to propose distant points in parameter space that are still likely to be accepted. HMC's mixing time scales as . a dramatic improvement that makes Bayesian inference practical in hundreds or thousands of dimensions.
HMC is the engine behind Stan, one of the most widely used probabilistic programming languages, and is the default MCMC algorithm for serious Bayesian modeling.
Mental Model
Imagine a ball rolling on a surface whose height at position is the negative log-posterior . The ball naturally rolls toward regions of high posterior density (low ) and maintains momentum to traverse ridges between modes.
HMC simulates this physics. Instead of randomly proposing a new point, it gives the current point a random "kick" (momentum) and then follows the laws of physics (Hamilton's equations) for a fixed time. The endpoint of this trajectory is the proposal. Because the trajectory follows the posterior landscape, the proposal tends to be in a region of similar density. even if it is far from the starting point.
Formal Setup and Notation
We want to sample from a target distribution with density where are the parameters and is the potential energy (up to a constant).
HMC introduces an auxiliary momentum variable and defines the joint distribution:
where the Hamiltonian is:
The kinetic energy is typically , where is a mass matrix (often ). The marginal distribution of is the target , and .
Core Definitions
Hamilton's Equations
The Hamiltonian dynamics are governed by:
These equations describe how position and momentum evolve over time. The key property: the Hamiltonian is conserved along the trajectory. A particle moving along this trajectory stays on a contour of constant .
Leapfrog Integrator
Since we cannot solve Hamilton's equations analytically for general , we use a numerical integrator. The leapfrog (Stormer-Verlet) integrator with step size :
This is repeated times to produce the proposal. Each step requires one gradient evaluation .
HMC Algorithm
One iteration of HMC:
- Resample momentum: Draw
- Simulate dynamics: Starting from , run leapfrog steps with step size to get
- Negate momentum: Set (for reversibility)
- Metropolis accept/reject: Accept with probability
If the leapfrog integrator were exact (zero discretization error), the Hamiltonian would be perfectly conserved and the acceptance probability would be 1. In practice, the acceptance rate depends on and .
Main Theorems
Conservation Properties of Hamiltonian Dynamics
Statement
The exact Hamiltonian flow has three crucial properties:
- Energy conservation: for all .
- Volume preservation (Liouville's theorem): The flow preserves the Lebesgue measure on phase space .
- Reversibility: If is evolved for time , it returns to .
Together, these ensure that the dynamics leave the joint distribution invariant.
Intuition
Energy conservation means the proposal has the same Hamiltonian value as the starting point, so acceptance probability is 1. Volume preservation means we do not need a Jacobian correction in the acceptance probability. Reversibility ensures detailed balance. These three properties are exactly what we need for a valid MCMC proposal.
Why It Matters
These properties explain why HMC works. The Hamiltonian flow is a measure-preserving, reversible map on phase space that conserves energy. It is the ideal MCMC proposal: it moves you to a distant point with the same target density, guaranteed. The only reason we need the Metropolis-Hastings correction is that the leapfrog integrator introduces discretization error, breaking exact energy conservation.
Leapfrog Preserves Volume
Statement
The leapfrog integrator is a symplectic integrator: it preserves the symplectic 2-form on phase space. In particular, it is volume-preserving (the Jacobian of the leapfrog map has determinant 1) and time-reversible.
The discretization error in the Hamiltonian is per step and globally (the error does not accumulate over steps).
Intuition
The leapfrog integrator inherits the volume-preservation and reversibility of exact Hamiltonian dynamics, even though it does not conserve energy exactly. This is why leapfrog is used instead of, say, Euler's method: Euler's method is neither volume-preserving nor reversible, and would require a Jacobian correction that is expensive to compute.
The non-accumulation of energy error is the key miracle: after steps, the Hamiltonian error is still , not . This is because the leapfrog trajectory stays on a "shadow Hamiltonian" that is -close to the true Hamiltonian.
Why It Matters
Volume preservation means the Metropolis acceptance probability is simply with no Jacobian term. Reversibility ensures detailed balance. The global error means we can take large steps (large ) while maintaining high acceptance rates, which is exactly what makes HMC efficient.
HMC Satisfies Detailed Balance
Statement
The HMC transition kernel satisfies detailed balance with respect to the joint distribution . Therefore, is a stationary distribution of the HMC Markov chain (marginalizing over ).
Intuition
The momentum resampling step leaves invariant because is drawn from its conditional . The leapfrog + Metropolis step leaves invariant because leapfrog is volume-preserving and reversible, and the Metropolis correction accounts for the energy error. Composing two -invariant steps gives a -invariant chain.
Why HMC Beats Random-Walk MH
The fundamental problem with random-walk MH in dimensions:
- To maintain acceptance rate (optimal), the step size must scale as .
- The chain takes steps of size , so it needs steps to traverse one standard deviation of the target.
- Total cost to get one effective sample: density evaluations.
HMC with properly tuned and :
- The leapfrog trajectory can traverse the entire target distribution in one proposal.
- The optimal step size scales as .
- Total cost per effective sample: gradient evaluations.
The improvement from to is what makes HMC practical for high-dimensional Bayesian inference. The cost is requiring , which is not always available.
MCMC Sampler Comparison
The following table compares HMC against other MCMC methods across the dimensions that matter for practical Bayesian inference.
| Property | Random-walk MH | Gibbs sampling | HMC (NUTS) | Langevin MCMC |
|---|---|---|---|---|
| Cost per step | density eval | conditional evals | gradient evals | gradient eval |
| Scaling with | for one effective sample | (worse with correlation) | ||
| Requires gradients | No | No | Yes | Yes |
| Handles correlation | Poorly (slow mixing) | Poorly (blocked Gibbs helps) | Well (momentum carries through correlated dimensions) | Moderately |
| Tuning parameters | Proposal scale | Conditional distributions | , (NUTS auto-tunes both) | Step size only |
| Optimal acceptance rate | 0.234 | 1.0 (exact conditionals) | 0.65 (HMC) or 0.80 (NUTS) | 0.574 |
| Multimodal targets | Very slow | Very slow | Slow (cannot jump between modes) | Very slow |
When HMC Fails
HMC is not a universal solution. Several target distribution geometries cause problems.
Multimodal targets. HMC explores a single mode efficiently but cannot jump between well-separated modes. The Hamiltonian dynamics are continuous, so the trajectory must pass through the low-density region between modes. If this region has very low density (high energy barrier), the trajectory would need enormous kinetic energy (momentum) to cross it, which is extremely unlikely under the Gaussian momentum distribution. For multimodal posteriors, parallel tempering or mixture proposals are needed alongside HMC.
Discontinuous densities. The leapfrog integrator requires , which does not exist at density discontinuities. Models with discrete parameters, hard constraints, or indicator functions cannot be sampled directly with HMC. Reparameterization (replacing hard constraints with smooth approximations) is the standard workaround.
Funnel geometries. Neal's funnel is the canonical example: , . When is large and negative, the are tightly concentrated near zero, requiring very small step sizes. When is large and positive, the spread widely, allowing large steps. No single step size works across the entire distribution. Non-centered parameterizations (writing the model in terms of standard normal variates and transforming) resolve this for many hierarchical models.
Very high dimensions without structure. At , even HMC's scaling becomes expensive. Each leapfrog step requires a full gradient computation over all parameters. For large neural networks, variational inference (which uses gradient descent to optimize an approximate posterior) is preferred over MCMC because its cost per iteration does not grow with the number of posterior samples needed.
NUTS: No-U-Turn Sampler
NUTS (No-U-Turn Sampler)
HMC has two tuning parameters: step size and number of leapfrog steps . Choosing is difficult:
- Too small : the trajectory does not explore enough, behaving like a random walk
- Too large : the trajectory doubles back on itself, wasting computation
NUTS (Hoffman & Gelman, 2014) automatically selects by running the leapfrog trajectory until it makes a "U-turn". WHEN the trajectory starts moving back toward the starting point. Formally, it uses a doubling scheme: double the trajectory length until the criterion is met.
NUTS also adapts during warmup to achieve a target acceptance rate (typically 0.8). This makes HMC tuning-free.
NUTS is the default sampler in Stan and PyMC. In practice, you almost never tune HMC by hand. You use NUTS and let it adapt.
Connection to Symplectic Geometry
The leapfrog integrator is a symplectic integrator, meaning it preserves the symplectic 2-form . Volume preservation is a consequence of symplecticity (by Liouville's theorem for symplectic maps).
Symplecticity is strictly stronger than volume preservation. It ensures that the leapfrog trajectory is qualitatively correct: it stays on a "shadow Hamiltonian" that is -close to . Non-symplectic integrators (like Euler) can be volume-preserving but still exhibit energy drift, making them unsuitable for HMC.
This connection to symplectic geometry is not just theoretical elegance. It is the practical reason why HMC works. The mathematical structure of Hamiltonian systems guarantees that the leapfrog integrator has the exact properties needed for efficient MCMC.
Proof Ideas and Templates Used
The correctness of HMC uses:
- Liouville's theorem: Hamiltonian flow preserves phase space volume, so no Jacobian correction is needed.
- Detailed balance via reversibility: The momentum flip plus leapfrog reversibility ensures the proposal is its own inverse, satisfying detailed balance with the Metropolis correction.
- Backward error analysis: The leapfrog trajectory exactly solves a modified Hamiltonian , explaining why the energy error is bounded.
Canonical Examples
Sampling from a correlated Gaussian
Consider with having condition number . Random-walk MH takes steps per effective sample. HMC with mass matrix (preconditioning) takes leapfrog steps per effective sample. For , this is the difference between 10,000 steps and 10 steps.
Bayesian logistic regression
With parameters and data points, random-walk MH requires tens of thousands of iterations to converge. HMC (via Stan) typically converges in a few hundred iterations, each requiring --50 leapfrog steps. The gradient is the gradient of the negative log-posterior, computable in time.
Common Confusions
HMC requires gradients, not Hessians
HMC uses only first-order gradient information , not the Hessian. The mass matrix can be set to an estimate of the posterior covariance (as NUTS does during warmup), but this is a preconditioning choice, not a requirement. Each leapfrog step costs one gradient evaluation, the same as one step of gradient descent.
The momentum is auxiliary and gets discarded
After each HMC step, the momentum is resampled from scratch. It is an auxiliary variable introduced solely to enable Hamiltonian dynamics. The samples of interest are the positions only. The momentum is never part of the output; it is a computational device.
HMC is not gradient descent
Although HMC uses gradients, it is structurally different from gradient descent. Gradient descent deterministically moves toward a mode. HMC uses gradients to navigate the entire posterior distribution, including regions far from the mode. The random momentum resampling ensures exploration, and the Metropolis correction ensures correctness. HMC samples from the posterior; gradient descent finds the MAP estimate.
The acceptance rate should not be 1.0
An acceptance rate of 1.0 means is too small and the chain is taking unnecessarily short steps. The optimal acceptance rate for HMC is around 0.65 (for standard HMC) or 0.80 (for NUTS). Lower rates mean is too large (too much discretization error); higher rates mean is too small (too many wasted gradient evaluations).
Summary
- HMC = Hamiltonian dynamics + leapfrog integrator + Metropolis correction
- Position = parameters, momentum = auxiliary (Gaussian)
- Hamiltonian where
- Leapfrog is symplectic: volume-preserving, reversible, energy error
- Scaling: vs. random walk MH's
- NUTS automatically tunes trajectory length and step size
- NUTS is the default in Stan and PyMC. use it for Bayesian inference
Exercises
Problem
Write out the three steps of one leapfrog iteration for (Gaussian target) with . What is ?
Problem
Explain why Euler's method (, ) is unsuitable for HMC, even though it is simpler than leapfrog.
Problem
In dimensions, estimate the relative cost (in gradient evaluations) of getting one effective sample from random-walk MH vs. HMC. Assume the target is a standard Gaussian.
References
Canonical:
- Neal, "MCMC Using Hamiltonian Dynamics," Handbook of MCMC (2011)
- Hoffman & Gelman, "The No-U-Turn Sampler" (NUTS, 2014)
Current:
-
Betancourt, "A Conceptual Introduction to Hamiltonian Monte Carlo" (2017)
-
Stan Development Team, Stan User's Guide
-
Robert & Casella, Monte Carlo Statistical Methods (2004), Chapters 3-7
-
Brooks et al., Handbook of MCMC (2011), Chapters 1-5
Next Topics
The natural next steps from HMC:
- Riemannian HMC: adapting the mass matrix to local curvature
- Variational inference: the optimization-based alternative to MCMC
- Importance sampling: a non-MCMC approach to estimating expectations, useful when the target is easy to evaluate but hard to sample from directly
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Metropolis-Hastings AlgorithmLayer 2
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A