Sampling MCMC
Monte Carlo Methods
Approximate expectations by sampling: the Monte Carlo estimator, its variance, the $\sqrt{N}$ convergence rate, and the variance-reduction tricks that make practical Bayesian inference, REINFORCE, and ELBO estimation work.
Why This Matters
Almost every quantity a Bayesian, an RL agent, or a variational autoencoder cares about is an expectation. The posterior predictive , the policy gradient , the evidence lower bound gradient : all of them are integrals against high-dimensional distributions with no closed form. Monte Carlo is the move that makes them tractable. You cannot compute the integral, so you draw samples and average.
The pedagogical point is that this works for one specific reason. The sample mean is an unbiased estimator with variance that decays like , and the central limit theorem promises -rate convergence regardless of dimension. That dimension-free rate is what separates Monte Carlo from quadrature methods, whose grid sizes blow up exponentially. Every interesting Monte Carlo trick afterwards is a variance-reduction device: importance sampling, control variates, antithetic variates, Rao-Blackwellization, stratification, and all the way out to MCMC and variational methods.
Mental Model
You want the average value of some function under a probability distribution . Forget closed-form integration; just generate independent draws and compute the empirical average . The is random because the samples are random, but it concentrates around the true expectation as grows.
Two things govern its quality. The of is zero when the samples come from . The of shrinks like , where is the population variance of . So the error decays as , which is the universal rate. To halve the error you quadruple the sample count, period.
Variance reduction is everything that bends this constant. You cannot change the rate without leaving iid sampling, so you change instead.
Formal Setup and Notation
Let be a probability distribution on a measurable space and let be an integrable function. The quantity of interest is .
Monte Carlo Estimator
Given iid samples , the (plain) Monte Carlo estimator of is
It is a function of the random samples, so it is itself a random variable. We want to know: how close is to , and how does that closeness scale with ?
Unbiased Estimator
An estimator of a parameter is if . The plain Monte Carlo estimator is unbiased: by linearity of expectation.
Main Theorems
Monte Carlo Estimator: Unbiasedness and Variance
Statement
Let and let . Then:
- The estimator is unbiased: .
- Its variance is .
- Its root mean squared error is .
Intuition
Linearity of expectation makes the bias claim immediate. For the variance, independence is the only thing that matters: variances of independent sums add, so , and dividing by then gives once the factor is applied.
The square root in the RMSE is the headline rate. It is dimension-free because nothing in the proof references the dimension of . That is why Monte Carlo beats grid-based quadrature in high dimensions even though it is much weaker per sample.
Proof Sketch
By linearity, .
For the variance, independence gives , and .
Why It Matters
Every variance-reduction trick in Monte Carlo aims at the constant , not the rate. You cannot beat with iid samples (this is the content of the central limit theorem); you can only reduce the variance of the per-sample estimator. That reframes the whole subject: control variates, antithetic variates, importance sampling, stratification, and Rao-Blackwellization are all tactics for choosing a different estimator of the same expectation, with the same unbiasedness, but smaller variance.
Failure Mode
Heavy-tailed integrands break this. If has infinite variance under (for example under a distribution with mass near zero), the variance bound is vacuous and the empirical average can have fluctuations that do not shrink. Robustness requires checking that , not just .
Asymptotic Distribution of the Monte Carlo Estimator
Statement
Under the same conditions, the law of large numbers gives almost surely, and the central limit theorem gives
In words: the rescaled error is asymptotically Gaussian with variance , independent of dimension.
Intuition
The CLT is the source of the rate. It tells you the shape of the Monte Carlo error in addition to its size: roughly Gaussian, mean zero, scale . This is what justifies the standard confidence interval that appears in every Monte Carlo report.
Importantly, the CLT gives a confidence interval that is automatic in dimension. It does not care whether is or . Grid-based numerical integration cannot make this claim.
Why It Matters
The CLT lets you report Monte Carlo answers with calibrated uncertainty. The same CLT is also the starting point for the diffusion-limit analyses of MCMC scaling (Roberts-Gelman-Gilks 1997 for random-walk MH and Beskos-Pillai-Roberts-Sanz-Serna-Stuart 2013 for HMC), where the optimal and step sizes are the ones that match the diffusion's CLT.
Failure Mode
The CLT requires finite second moment. If , can converge much more slowly (or to a stable distribution other than the Gaussian, with infinite-variance fluctuations). Importance sampling with heavy-tailed weights is the canonical setting where this matters: the nominal rate hides infinite-variance estimators, and reported standard errors become meaningless.
Why Is Dimension-Free
The result does not contain a dimension. Compare to deterministic quadrature: the trapezoidal rule on a -dimensional grid with total nodes has error , which collapses past . That is the famous curse of dimensionality for quadrature.
Monte Carlo trades that curse for a different one. The constant can depend badly on dimension, especially when is concentrated on a thin manifold of . Importance sampling in dimensions is the canonical counterexample: even though the rate stays , the variance can scale like when proposal and target diverge, so the practical sample size has to grow exponentially anyway. The lesson is twofold: the rate is dimension-free, the constant is not, and the entire craft of Monte Carlo is keeping that constant small.
Variance Reduction: Tactics for Shrinking
Importance Sampling
Sample from a proposal instead of , then reweight. Define the to satisfy whenever . The estimator
is unbiased for . Its variance equals , which can be much smaller than when is well chosen, but much larger when it is not. See Importance Sampling for the full treatment, including effective sample size and weight degeneracy.
Control Variates
Let be a function with known expectation . The control-variate estimator
is unbiased for for any . The variance-minimizing choice is , achieving where is the correlation between and .
Strongly correlated controls give arbitrarily large variance reduction. This is why control variates underpin practical policy-gradient estimators (REINFORCE with a learned baseline) and ELBO gradient estimators (the score-function gradient with a baseline).
Antithetic Variates
For any monotone function and a uniform , the pair is negatively correlated. Averaging over uniform draws produces an unbiased estimator with variance no larger than the iid version, often strictly smaller.
Generalizes to any setting where you can pair a sample with a deterministic "opposite" that has the same distribution but is anti-correlated in .
Rao-Blackwellization
Rao-Blackwellization replaces a sample by its conditional expectation for some sufficient statistic or partial structure . By the , the new estimator has equal mean and weakly smaller variance. In practice this is the trick that makes mixture-model Gibbs samplers efficient: integrate out conjugate parts analytically, sample only the parts you have to.
Where Monte Carlo Lives in ML
The foundations of three ML subjects collapse to Monte Carlo estimation once you look closely.
Bayesian inference. Posterior expectations are exactly what MCMC approximates. The Monte Carlo average over chain samples is the same as above, except the samples are no longer iid. The CLT now requires geometric ergodicity of the chain instead of independence, and the asymptotic variance picks up an autocorrelation factor (the integrated autocorrelation time).
REINFORCE / policy gradients. The policy gradient is a Monte Carlo expectation. Evaluating it requires sampling trajectories. Variance reduction here is not optional; raw REINFORCE has so much variance that learning is impractical. Control variates ("baselines") and generalized advantage estimation are the standard fixes.
ELBO gradients. For variational autoencoders, the ELBO gradient is again a Monte Carlo expectation over the variational distribution. The reparameterization trick with sampled from a noise distribution is what gives a low-variance gradient estimator; without it, the score-function form is too noisy to train.
The unifying lesson: anywhere expectations are intractable, an unbiased-and-low-variance Monte Carlo estimator is what makes the gradient useful for stochastic optimization.
Common Confusions
More samples does not always make Monte Carlo good
The rate is asymptotic and assumes finite variance. If is huge or infinite, may need to be astronomically large for the estimator to be useful. Diagnostic: check that the running estimate of stabilizes as grows. If it does not, the population variance is probably not finite and confidence intervals are unreliable.
Unbiased and consistent are not the same
The plain Monte Carlo estimator is both unbiased ( for every ) and consistent ( as ). But some Monte Carlo estimators are biased and consistent (self-normalized importance sampling), and some are unbiased but inconsistent (degenerate proposal distributions). Bayesian inference workflows rely heavily on biased-but-consistent self-normalized importance sampling, so the distinction matters.
Variance reduction does not change the rate
Control variates and antithetic variates do not turn into . They reduce the constant in . The rate is a consequence of the CLT and is fixed for iid sampling. The only way to beat is to leave iid sampling: quasi-Monte Carlo (low-discrepancy sequences) achieves for smooth integrands in moderate dimensions, and multi-level Monte Carlo can also achieve faster rates under structural assumptions.
Monte Carlo is not the same as MCMC
Plain Monte Carlo requires that you can already sample from . When you cannot — the typical case in Bayesian posterior inference, where is known only up to a normalizing constant — you need MCMC. MCMC is Monte Carlo with correlated samples generated by a Markov chain whose stationary distribution is . The variance analysis carries over but with an autocorrelation correction.
Canonical Examples
Estimating $\pi$ by hit-or-miss
Let and define . Then , so converges to .
The estimator variance is with , giving RMSE . To get to 3 decimal places (RMSE ) requires . This is the canonical example because it shows the rate concretely; it is also a terrible way to compute in practice.
Posterior expectation under a Gaussian prior and likelihood
Suppose and with one observation . The posterior is (a closed form here). To estimate , draw for and average .
The exact answer is (third raw moment of ). The Monte Carlo answer with gets within with 95% confidence. Doubling the precision requires .
REINFORCE for a contextual bandit
A logistic policy in a contextual bandit with reward . The policy-gradient estimator is
where is a control variate (a baseline, often ). Subtracting does not change the gradient in expectation (since under ) but reduces variance by where is the correlation between and .
Exercises
Problem
Let . Show that the estimator is an unbiased estimator of , and explain why the rather than correction is needed.
Problem
A control variate has and correlation with . Derive the optimal coefficient and the resulting variance reduction factor .
Problem
You want to estimate where . The exact answer is . Show that the variance of the plain Monte Carlo estimator scales like , and conclude that for you need samples to match a 1% relative error.
References
Canonical:
- Robert & Casella, Monte Carlo Statistical Methods (2nd ed., 2004), Chapters 1-3. The standard graduate-level reference for plain Monte Carlo, importance sampling, and variance reduction.
- Owen, Monte Carlo Theory, Methods and Examples (2013, online at statweb.stanford.edu/~owen/mc/), Chapters 1-9. Free, comprehensive, with a clear treatment of estimator efficiency.
- Liu, Monte Carlo Strategies in Scientific Computing (2001), Chapters 1-4.
Statistical foundations:
- Casella & Berger, Statistical Inference (2nd ed., 2002), Chapter 5 (sampling distributions, unbiased estimation), Chapter 7 (point estimation, Rao-Blackwell).
- Lehmann & Casella, Theory of Point Estimation (2nd ed., 1998), Chapters 1-2.
ML applications:
- Murphy, Probabilistic Machine Learning: Advanced Topics (2023), Chapter 11 (Monte Carlo inference).
- Bishop, Pattern Recognition and Machine Learning (2006), Chapter 11 (sampling methods).
- Sutton & Barto, Reinforcement Learning (2nd ed., 2018), Chapter 13 (policy-gradient methods, REINFORCE, baselines as control variates).
- Kingma & Welling, "Auto-Encoding Variational Bayes" (2013, arXiv:1312.6114). Source paper for the reparameterization trick as a variance-reduction device for ELBO gradient estimators.
Next Topics
The natural next steps from plain Monte Carlo:
- Importance Sampling: when you can't sample from but can evaluate it, sample from a proposal and reweight. The cornerstone of off-policy RL and variational diagnostics.
- Rejection Sampling: produce exact iid samples from when an envelope is available. The pedagogical bridge to MCMC.
- Markov Chain Monte Carlo: relax iid sampling. Build a Markov chain whose stationary distribution is the target.
- Metropolis-Hastings: the foundational MCMC construction for sampling unnormalized posteriors.
Last reviewed: May 6, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
- Central Limit Theoremlayer 0B · tier 1
- Law of Large Numberslayer 0B · tier 1
Derived topics
1- Markov Chain Monte Carlolayer 2 · tier 1
Graph-backed continuations