Sampling MCMC

No-U-Turn Sampler and Neal's Funnel

NUTS removes HMC's hand-tuned trajectory length, but Neal's funnel shows why geometry still dominates. Centered hierarchical models create narrow necks, divergences, and misleading convergence unless you reparameterize.

AdvancedTier 2Stable~60 min

Prerequisites

Hamiltonian Monte Carlo Bayesian Estimation

Prereq Map

Why This Matters

Hamiltonian Monte Carlo already solves one huge problem in Bayesian computation: random-walk proposals in high dimensions. But standard HMC still asks the user to choose a trajectory length. Too short and it random-walks; too long and it wastes computation by curling back over ground it already explored.

The No-U-Turn Sampler (NUTS) fixes that tuning problem. It grows the trajectory until the simulated path starts turning back on itself, then stops automatically. That is why Stan made NUTS the default sampler.

But NUTS does not make geometry disappear. Neal's funnel is the classic counterexample. A centered hierarchical model can have a narrow neck where latent variables are forced into a tiny region when the scale parameter is small. NUTS will still struggle there, and the struggle will often surface as divergences.

NUTS fixes path-length tuning, but it does not repeal bad geometry

Neal's funnel is the canonical reminder that hierarchical geometry matters more than sampler branding. NUTS automatically stops long trajectories before they double back, but the centered funnel still creates a narrow neck where divergences cluster.

Canonical funnel

$A standard form is y \sim N (0, 3^{2}) and x_{i} ∣ y \sim N (0, exp (y /2)), so the conditional variance is exp (y) .$

What NUTS actually adapts

NUTS removes the need to hand-pick the leapfrog count by growing a binary tree until the trajectory begins to turn back on itself. The geometry can still be bad even when the path-length tuning is good.

Why reparameterization wins

$Write x_{i} = exp (y /2) \tilde{x}_{i} with \tilde{x}_{i} \sim N (0, 1) and the awkward funnel collapses into a much more isotropic base geometry.$

Mental Model

Think of NUTS as "HMC with automatic path length." It is still the same basic Hamiltonian system underneath: leapfrog integration, a mass matrix, and a Metropolis accept/reject correction. The automation only decides how long to follow the trajectory before stopping.

Now think of Neal's funnel as a geometry trap. In the wide mouth of the funnel, the latent variables can spread out. In the narrow neck, they are forced close to zero. The local curvature changes violently across the space, so one global step size is always wrong somewhere. That is why divergences cluster near the neck.

Formal Setup

Definition

No-U-Turn Sampler (NUTS)

NUTS is an extension of HMC that adaptively determines the effective number of leapfrog steps. Starting from the current state, it recursively doubles the trajectory in forward and backward time to build a binary tree of candidate states. The tree expansion stops once the trajectory begins to double back on itself, meaning further simulation would mostly retrace the same region.

Definition

Neal's Funnel

One standard $d$ -dimensional form of Neal's funnel is

y \sim \mathcal N(0, 3^2), \qquad x_i \mid y \sim \mathcal N(0, \exp(y/2)) \quad \text{for } i = 1, \dots, d-1.

The second argument of $\mathcal N(\mu, \sigma)$ here is the standard deviation, so the conditional variance of each $x_i$ is $\exp(y)$ . When $y$ is large and negative, the $x_i$ variables are squeezed into a tight neck. When $y$ is large and positive, the funnel opens dramatically.

Definition

Divergent Transition

A divergent transition is a leapfrog trajectory whose numerical integration error becomes so large that the simulated Hamiltonian flow can no longer be trusted. In practice, divergences are not random warnings. They are localized evidence that the target geometry is forcing the integrator into a regime it cannot resolve with the current parameterization and step size.

Main Propositions

Proposition

The NUTS stopping rule adapts path length, not target geometry

Statement

NUTS stops tree expansion once the simulated trajectory begins to reverse direction. A standard stopping check is whether either endpoint momentum points back toward the other endpoint:

(q^+ - q^-)^\top M^{-1} p^- < 0 \quad \text{or} \quad (q^+ - q^-)^\top M^{-1} p^+ < 0.

This removes manual tuning of the leapfrog count $L$ , but it does not change the geometry of the target density itself.

Intuition

If the trajectory has started pointing back toward where it came from, taking more leapfrog steps mostly wastes compute. NUTS detects that automatically. But the same state space, the same curvature, and the same problematic necks are still there.

Why It Matters

This is the clean way to think about NUTS. It is a major usability and efficiency improvement over fixed-length HMC, but it is not a cure for bad hierarchical parameterizations.

Proposition

Non-centered coordinates undo the funnel neck

Statement

Define new latent variables $\tilde{x}_i \sim \mathcal N(0,1)$ and map them to the original coordinates by

x_i = \exp(y/2)\tilde{x}_i.

In the non-centered coordinates $(y, \tilde{x})$ , the prior factorizes into a product of standard normals. The severe neck geometry disappears from the base coordinates, making HMC and NUTS much easier to tune.

Intuition

The centered model stores scale and local variation in the same coordinates. When the scale collapses, the local variables must collapse with it. The non-centered parameterization separates those roles: the $\tilde{x}_i$ stay in a standard normal cloud, and the exponential scaling is applied only when you map back to the original model coordinates.

Proof Sketch

Substitute $x_i = \exp(y/2)\tilde{x}_i$ into the centered prior. The Gaussian term for $x_i \mid y$ reduces to a standard normal term in $\tilde{x}_i$ , and the Jacobian factor is exactly canceled by the scale term in the Gaussian density. The resulting base prior is a product of a normal density in $y$ and standard normal densities in the $\tilde{x}_i$ .

Why It Matters

This is the canonical example of a modeling fix beating a sampler fix. Reparameterization often improves posterior geometry more than any amount of sampler retuning.

Failure Mode

Non-centering is not universally better. When the data strongly identify the local latent variables, the centered parameterization can mix better. The choice depends on how much the posterior is dominated by the prior hierarchy versus the likelihood.

What Divergences Really Mean

A common mistake is to read divergences as mere implementation noise. They are usually telling you something structural:

the posterior has regions of extremely high curvature,
the leapfrog step size cannot resolve those regions everywhere,
and the chain is failing exactly where the geometry is hardest.

So the right response is rarely "just run longer." It is usually:

inspect where the divergences land,
reparameterize,
strengthen weak priors,
and only then revisit sampler tuning.

Centered vs Non-Centered Hierarchies

The funnel appears whenever a global scale parameter governs many local variables. This is not an artificial pathology. Hierarchical normals, multilevel regressions, latent-variable models, and many Bayesian neural network priors all create the same tension.

Centered parameterization stores the local variables directly:

x_i \mid \sigma \sim \mathcal N(0, \sigma).

Non-centered parameterization stores a standard normal base variable and then rescales it:

\tilde{x}_i \sim \mathcal N(0,1), \qquad x_i = \sigma \tilde{x}_i.

The posterior decides which version is better. Weak-data hierarchies usually prefer the non-centered form; strong-data hierarchies can prefer the centered form.

For the direct side-by-side decision rule, see Centered vs. Non-Centered Hierarchical Models.

Common Confusions

Watch Out

NUTS is not just HMC with fewer knobs

It is true that NUTS removes the leapfrog-count knob. But the mass matrix, step size adaptation, and parameterization still matter a great deal. Treating NUTS as a universal turnkey fix hides the geometry problem.

Watch Out

A good R-hat does not erase divergences

Chains can agree with each other while still missing important regions of the target geometry. Divergences are a local geometric warning; $\hat{R}$ is a global mixing diagnostic. They answer different questions.

Watch Out

Neal's funnel is not a toy problem with no real analogue

It is a distilled version of a real hierarchical pathology. Whenever a global scale controls many local latent variables, the same neck-versus-mouth geometry can appear in practice.

Exercises

ExerciseCore

Problem

Why does a centered hierarchical normal model become hard for HMC when the global scale gets very small?

ExerciseAdvanced

Problem

Why can NUTS still show many divergences even after warmup chooses a reasonable step size and mass matrix?

ExerciseResearch

Problem

Suppose a hierarchical model shows persistent divergences under NUTS. How would you decide whether the right move is non-centering, stronger priors, or a more fundamental model rewrite?

References

Matthew D. Hoffman and Andrew Gelman, The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo, JMLR 2014. The original NUTS paper.
Michael Betancourt, A Conceptual Introduction to Hamiltonian Monte Carlo, arXiv, revised 2018. Best conceptual systems-and-geometry account of HMC.
Michael Betancourt and Mark Girolami, Hamiltonian Monte Carlo for Hierarchical Models, arXiv 2013. Canonical source on hierarchical pathologies and reparameterization.
Radford M. Neal, Slice Sampling, Annals of Statistics 2003. Contains the original funnel distribution test problem.
Michael Betancourt, Diagnosing Biased Inference with Divergences, case study. Best concrete explanation of why divergences are geometric evidence.
Stan Development Team, Sampling Difficulties with Problematic Priors, Stan User's Guide. Practical reference for funnel-style pathologies and reparameterization.

Next Topics

This page is the geometry-first sequel to HMC. The natural continuations are:

Hamiltonian Monte Carlo for the sampler mechanics,
Bayesian Neural Networks for a more applied hierarchical setting,
and Bayesian State Estimation for sequential latent-variable models where geometry and inference meet.

Last reviewed: April 25, 2026

Prerequisites

Foundations this topic depends on.

Hamiltonian Monte CarloLayer 3
Metropolis-Hastings AlgorithmLayer 2
Common Probability DistributionsLayer 0A
Sets, Functions, and RelationsLayer 0A
Basic Logic and Proof TechniquesLayer 0A
Bayesian EstimationLayer 0B
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyLayer 0B
Differentiation in RnLayer 0A
Vectors, Matrices, and Linear MapsLayer 0A
Continuity in RⁿLayer 0A
Metric Spaces, Convergence, and CompletenessLayer 0A
Central Limit TheoremLayer 0B
Law of Large NumbersLayer 0B
Random VariablesLayer 0A
Kolmogorov Probability AxiomsLayer 0A
Expectation, Variance, Covariance, and MomentsLayer 0A
KL DivergenceLayer 1
Information Theory FoundationsLayer 0B

Next Topics

Bayesian Neural NetworksContinue →Bayesian State EstimationContinue →