Monte Carlo Methods

Sneiderman, Robby

Sampling MCMC

Monte Carlo Methods

Approximate expectations by sampling: the Monte Carlo estimator, its variance, the $\sqrt{N}$ convergence rate, and the variance-reduction tricks that make practical Bayesian inference, REINFORCE, and ELBO estimation work.

CoreTier 1StableCore spine~55 min

Prerequisites

Expectation Variance Covariance Moments Law of Large Numbers Central Limit Theorem

Quiz (1)Prereq Map

Why This Matters

Almost every quantity a Bayesian, an RL agent, or a variational autoencoder cares about is an expectation. The posterior predictive $\int p(y \mid \theta)\, p(\theta \mid \mathcal{D})\, d\theta$ , the policy gradient $\mathbb{E}_{\tau \sim \pi_\theta}[\sum_t r_t \nabla \log \pi_\theta(a_t \mid s_t)]$ , the evidence lower bound gradient $\nabla_\phi \mathbb{E}_{q_\phi(z \mid x)}[\log p(x, z) - \log q_\phi(z \mid x)]$ : all of them are integrals against high-dimensional distributions with no closed form. Monte Carlo is the move that makes them tractable. You cannot compute the integral, so you draw samples and average.

The pedagogical point is that this works for one specific reason. The sample mean is an unbiased estimator with variance that decays like $1/N$ , and the central limit theorem promises $\sqrt{N}$ -rate convergence regardless of dimension. That dimension-free rate is what separates Monte Carlo from quadrature methods, whose grid sizes blow up exponentially. Every interesting Monte Carlo trick afterwards is a variance-reduction device: importance sampling, control variates, antithetic variates, Rao-Blackwellization, stratification, and all the way out to MCMC and variational methods.

Mental Model

You want the average value of some function $f$ under a probability distribution $\pi$ . Forget closed-form integration; just generate $N$ independent draws $X_1, \ldots, X_N \sim \pi$ and compute the empirical average $\hat\mu_N = \frac{1}{N}\sum_i f(X_i)$ . The $\hat\mu_N$ is random because the samples are random, but it concentrates around the true expectation as $N$ grows.

Two things govern its quality. The of $\hat\mu_N$ is zero when the samples come from $\pi$ . The of $\hat\mu_N$ shrinks like $\sigma^2(f) / N$ , where $\sigma^2(f) = \text{Var}_\pi[f(X)]$ is the population variance of $f$ . So the error decays as $\sigma(f) / \sqrt{N}$ , which is the universal rate. To halve the error you quadruple the sample count, period.

Variance reduction is everything that bends this constant. You cannot change the $\sqrt{N}$ rate without leaving iid sampling, so you change $\sigma(f)$ instead.

Formal Setup and Notation

Let $\pi$ be a probability distribution on a measurable space $\mathcal{X}$ and let $f: \mathcal{X} \to \mathbb{R}$ be an integrable function. The quantity of interest is $\mu := \mathbb{E}_{X \sim \pi}[f(X)] = \int_\mathcal{X} f(x)\, d\pi(x)$ .

Definition

Monte Carlo Estimator $\overset{μ}{^}_{N}$

Given iid samples $X_1, \ldots, X_N \sim \pi$ , the (plain) Monte Carlo estimator of $\mu = \mathbb{E}_\pi[f(X)]$ is

$\hat\mu_N = \frac{1}{N} \sum_{i=1}^N f(X_i).$

It is a function of the random samples, so it is itself a random variable. We want to know: how close is $\hat\mu_N$ to $\mu$ , and how does that closeness scale with $N$ ?

Definition

Unbiased Estimator

An estimator $\hat\theta$ of a parameter $\theta$ is if $\mathbb{E}[\hat\theta] = \theta$ . The plain Monte Carlo estimator is unbiased: $\mathbb{E}[\hat\mu_N] = \mu$ by linearity of expectation.

Main Theorems

Theorem

Monte Carlo Estimator: Unbiasedness and Variance

Statement

Let $X_1, \ldots, X_N \overset{\text{iid}}{\sim} \pi$ and let $\sigma^2 = \text{Var}_\pi[f(X)] < \infty$ . Then:

The estimator $\hat\mu_N = \frac{1}{N}\sum_i f(X_i)$ is unbiased: $\mathbb{E}[\hat\mu_N] = \mu$ .
Its variance is $\text{Var}[\hat\mu_N] = \sigma^2 / N$ .
Its root mean squared error is $\sqrt{\text{Var}[\hat\mu_N]} = \sigma / \sqrt{N}$ .

Intuition

Linearity of expectation makes the bias claim immediate. For the variance, independence is the only thing that matters: variances of independent sums add, so $\text{Var}[\sum_i f(X_i)] = N\sigma^2$ , and dividing by $N$ then gives $\sigma^2 / N$ once the $1/N^2$ factor is applied.

The square root in the RMSE is the headline rate. It is dimension-free because nothing in the proof references the dimension of $\mathcal{X}$ . That is why Monte Carlo beats grid-based quadrature in high dimensions even though it is much weaker per sample.

Proof Sketch

By linearity, $\mathbb{E}[\hat\mu_N] = \frac{1}{N}\sum_i \mathbb{E}[f(X_i)] = \mu$ .

For the variance, independence gives $\text{Var}\!\left[\sum_{i=1}^N f(X_i)\right] = \sum_{i=1}^N \text{Var}[f(X_i)] = N\sigma^2$ , and $\text{Var}[\hat\mu_N] = \text{Var}[\sum_i f(X_i)] / N^2 = \sigma^2 / N$ .

Why It Matters

Every variance-reduction trick in Monte Carlo aims at the constant $\sigma^2$ , not the rate. You cannot beat $\sqrt{N}$ with iid samples (this is the content of the central limit theorem); you can only reduce the variance of the per-sample estimator. That reframes the whole subject: control variates, antithetic variates, importance sampling, stratification, and Rao-Blackwellization are all tactics for choosing a different estimator of the same expectation, with the same unbiasedness, but smaller variance.

Failure Mode

Heavy-tailed integrands break this. If $f$ has infinite variance under $\pi$ (for example $f(x) = 1/x$ under a distribution with mass near zero), the variance bound is vacuous and the empirical average can have $\sqrt{N}$ fluctuations that do not shrink. Robustness requires checking that $\mathbb{E}_\pi[f(X)^2] < \infty$ , not just $\mathbb{E}_\pi[|f(X)|] < \infty$ .

report a correction →

Theorem

Asymptotic Distribution of the Monte Carlo Estimator

Statement

Under the same conditions, the law of large numbers gives $\hat\mu_N \to \mu$ almost surely, and the central limit theorem gives

$\sqrt{N}\,(\hat\mu_N - \mu) \xrightarrow{d} \mathcal{N}(0, \sigma^2).$

In words: the rescaled error is asymptotically Gaussian with variance $\sigma^2$ , independent of dimension.

Intuition

The CLT is the source of the $\sqrt{N}$ rate. It tells you the shape of the Monte Carlo error in addition to its size: roughly Gaussian, mean zero, scale $\sigma / \sqrt{N}$ . This is what justifies the standard $\hat\mu_N \pm 1.96\,\hat\sigma / \sqrt{N}$ confidence interval that appears in every Monte Carlo report.

Importantly, the CLT gives a confidence interval that is automatic in dimension. It does not care whether $\mathcal{X}$ is $\mathbb{R}^2$ or $\mathbb{R}^{10000}$ . Grid-based numerical integration cannot make this claim.

Why It Matters

The CLT lets you report Monte Carlo answers with calibrated uncertainty. The same CLT is also the starting point for the diffusion-limit analyses of MCMC scaling (Roberts-Gelman-Gilks 1997 for random-walk MH and Beskos-Pillai-Roberts-Sanz-Serna-Stuart 2013 for HMC), where the optimal $O(d^{-1/2})$ and $O(d^{-1/4})$ step sizes are the ones that match the diffusion's CLT.

Failure Mode

The CLT requires finite second moment. If $\sigma^2 = \infty$ , $\hat\mu_N$ can converge much more slowly (or to a stable distribution other than the Gaussian, with infinite-variance fluctuations). Importance sampling with heavy-tailed weights is the canonical setting where this matters: the nominal $\sqrt{N}$ rate hides infinite-variance estimators, and reported standard errors become meaningless.

report a correction →

Why $\sqrt{N}$ Is Dimension-Free

The result $\text{Var}[\hat\mu_N] = \sigma^2 / N$ does not contain a dimension. Compare to deterministic quadrature: the trapezoidal rule on a $d$ -dimensional grid with $N$ total nodes has error $O(N^{-2/d})$ , which collapses past $d \approx 5$ . That is the famous curse of dimensionality for quadrature.

Monte Carlo trades that curse for a different one. The constant $\sigma$ can depend badly on dimension, especially when $f$ is concentrated on a thin manifold of $\pi$ . Importance sampling in $d$ dimensions is the canonical counterexample: even though the rate stays $\sqrt{N}$ , the variance can scale like $e^{cd}$ when proposal and target diverge, so the practical sample size has to grow exponentially anyway. The lesson is twofold: the rate is dimension-free, the constant is not, and the entire craft of Monte Carlo is keeping that constant small.

Variance Reduction: Tactics for Shrinking $\sigma^2$

Definition

Importance Sampling $\overset{μ}{^}_{N}^{IS}$

Sample from a proposal $q$ instead of $\pi$ , then reweight. Define the $q$ to satisfy $q(x) > 0$ whenever $\pi(x) f(x) \neq 0$ . The estimator

$\hat\mu_N^{\text{IS}} = \frac{1}{N} \sum_{i=1}^N \frac{\pi(X_i)}{q(X_i)}\, f(X_i), \quad X_i \overset{\text{iid}}{\sim} q$

is unbiased for $\mu$ . Its variance equals $\frac{1}{N}\text{Var}_q\!\left[\frac{\pi(X)}{q(X)} f(X)\right]$ , which can be much smaller than $\sigma^2 / N$ when $q$ is well chosen, but much larger when it is not. See Importance Sampling for the full treatment, including effective sample size and weight degeneracy.

Definition

Control Variates

Let $h: \mathcal{X} \to \mathbb{R}$ be a function with known expectation $\mathbb{E}_\pi[h(X)] = \nu$ . The control-variate estimator

$\hat\mu_N^{\text{cv}} = \frac{1}{N} \sum_{i=1}^N \bigl[ f(X_i) - c\,(h(X_i) - \nu) \bigr]$

is unbiased for $\mu$ for any $c$ . The variance-minimizing choice is $c^* = \text{Cov}_\pi[f(X), h(X)] / \text{Var}_\pi[h(X)]$ , achieving $\text{Var}[\hat\mu_N^{\text{cv}}] = \frac{\sigma^2}{N}(1 - \rho^2)$ where $\rho$ is the correlation between $f(X)$ and $h(X)$ .

Strongly correlated controls give arbitrarily large variance reduction. This is why control variates underpin practical policy-gradient estimators (REINFORCE with a learned baseline) and ELBO gradient estimators (the score-function gradient with a baseline).

Definition

Antithetic Variates

For any monotone function $f$ and a uniform $U \sim \text{Uniform}(0, 1)$ , the pair $(f(U), f(1-U))$ is negatively correlated. Averaging $\bigl(f(U_i) + f(1 - U_i)\bigr)/2$ over $N/2$ uniform draws produces an unbiased estimator with variance no larger than the iid version, often strictly smaller.

Generalizes to any setting where you can pair a sample with a deterministic "opposite" that has the same distribution but is anti-correlated in $f$ .

Definition

Rao-Blackwellization

Rao-Blackwellization replaces a sample $f(X)$ by its conditional expectation $\mathbb{E}[f(X) \mid Y]$ for some sufficient statistic or partial structure $Y$ . By the , the new estimator has equal mean and weakly smaller variance. In practice this is the trick that makes mixture-model Gibbs samplers efficient: integrate out conjugate parts analytically, sample only the parts you have to.

Where Monte Carlo Lives in ML

The foundations of three ML subjects collapse to Monte Carlo estimation once you look closely.

Bayesian inference. Posterior expectations $\mathbb{E}[\phi(\theta) \mid \mathcal{D}]$ are exactly what MCMC approximates. The Monte Carlo average over chain samples is the same $\hat\mu_N$ as above, except the samples are no longer iid. The CLT now requires geometric ergodicity of the chain instead of independence, and the asymptotic variance picks up an autocorrelation factor (the integrated autocorrelation time).

REINFORCE / policy gradients. The policy gradient $\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[\sum_t r_t \nabla_\theta \log \pi_\theta(a_t \mid s_t)]$ is a Monte Carlo expectation. Evaluating it requires sampling trajectories. Variance reduction here is not optional; raw REINFORCE has so much variance that learning is impractical. Control variates ("baselines") and generalized advantage estimation are the standard fixes.

ELBO gradients. For variational autoencoders, the ELBO gradient $\nabla_\phi \mathbb{E}_{q_\phi(z \mid x)}[\log p(x, z) - \log q_\phi(z \mid x)]$ is again a Monte Carlo expectation over the variational distribution. The reparameterization trick $z = g_\phi(\epsilon, x)$ with $\epsilon$ sampled from a noise distribution is what gives a low-variance gradient estimator; without it, the score-function form is too noisy to train.

The unifying lesson: anywhere expectations are intractable, an unbiased-and-low-variance Monte Carlo estimator is what makes the gradient useful for stochastic optimization.

Common Confusions

Watch Out

More samples does not always make Monte Carlo good

The $\sqrt{N}$ rate is asymptotic and assumes finite variance. If $\sigma^2$ is huge or infinite, $N$ may need to be astronomically large for the estimator to be useful. Diagnostic: check that the running estimate of $\hat\sigma^2$ stabilizes as $N$ grows. If it does not, the population variance is probably not finite and confidence intervals are unreliable.

Watch Out

Unbiased and consistent are not the same

The plain Monte Carlo estimator is both unbiased ( $\mathbb{E}[\hat\mu_N] = \mu$ for every $N$ ) and consistent ( $\hat\mu_N \to \mu$ as $N \to \infty$ ). But some Monte Carlo estimators are biased and consistent (self-normalized importance sampling), and some are unbiased but inconsistent (degenerate proposal distributions). Bayesian inference workflows rely heavily on biased-but-consistent self-normalized importance sampling, so the distinction matters.

Watch Out

Variance reduction does not change the rate

Control variates and antithetic variates do not turn $\sqrt{N}$ into $N$ . They reduce the constant $\sigma$ in $\sigma / \sqrt{N}$ . The rate is a consequence of the CLT and is fixed for iid sampling. The only way to beat $\sqrt{N}$ is to leave iid sampling: quasi-Monte Carlo (low-discrepancy sequences) achieves $O(N^{-1}(\log N)^d)$ for smooth integrands in moderate dimensions, and multi-level Monte Carlo can also achieve faster rates under structural assumptions.

Watch Out

Monte Carlo is not the same as MCMC

Plain Monte Carlo requires that you can already sample from $\pi$ . When you cannot — the typical case in Bayesian posterior inference, where $\pi$ is known only up to a normalizing constant — you need MCMC. MCMC is Monte Carlo with correlated samples generated by a Markov chain whose stationary distribution is $\pi$ . The variance analysis carries over but with an autocorrelation correction.

Canonical Examples

Example

Estimating $\pi$ by hit-or-miss

Let $X_i, Y_i \overset{\text{iid}}{\sim} \text{Uniform}(-1, 1)$ and define $f(X, Y) = \mathbb{1}[X^2 + Y^2 \leq 1]$ . Then $\mathbb{E}[f(X, Y)] = \pi / 4$ , so $\hat\mu_N$ converges to $\pi / 4$ .

The estimator variance is $p(1-p) / N$ with $p = \pi/4$ , giving RMSE $\approx 0.41 / \sqrt{N}$ . To get $\pi$ to 3 decimal places (RMSE $\sim 10^{-3}$ ) requires $N \sim 10^5$ . This is the canonical example because it shows the $\sqrt{N}$ rate concretely; it is also a terrible way to compute $\pi$ in practice.

Example

Posterior expectation under a Gaussian prior and likelihood

Suppose $\theta \sim \mathcal{N}(0, 1)$ and $X \mid \theta \sim \mathcal{N}(\theta, 1)$ with one observation $X = 2$ . The posterior is $\theta \mid X = 2 \sim \mathcal{N}(1, 1/2)$ (a closed form here). To estimate $\mathbb{E}[\theta^3 \mid X = 2]$ , draw $\theta_i \sim \mathcal{N}(1, 1/2)$ for $i = 1, \ldots, N$ and average $\theta_i^3$ .

The exact answer is $1 + 3 \cdot 1 \cdot (1/2) = 5/2$ (third raw moment of $\mathcal{N}(1, 1/2)$ ). The Monte Carlo answer with $N = 10^4$ gets within $\pm 0.04$ with 95% confidence. Doubling the precision requires $N = 4 \cdot 10^4$ .

Example

REINFORCE for a contextual bandit

A logistic policy $\pi_\theta(a \mid s) = \sigma(\theta^T \phi(s, a))$ in a contextual bandit with reward $r$ . The policy-gradient estimator is

$\hat g = \frac{1}{N} \sum_{i=1}^N \bigl(r_i - b\bigr) \nabla_\theta \log \pi_\theta(a_i \mid s_i)$

where $b$ is a control variate (a baseline, often $\hat{\mathbb{E}}[r]$ ). Subtracting $b$ does not change the gradient in expectation (since $\mathbb{E}[\nabla \log \pi] = 0$ under $\pi$ ) but reduces variance by $\rho^2$ where $\rho$ is the correlation between $r_i$ and $\nabla \log \pi_\theta$ .

Exercises

ExerciseCore

Problem

Let $X_1, \ldots, X_N \overset{\text{iid}}{\sim} \pi$ . Show that the estimator $\hat\sigma^2_N = \frac{1}{N-1}\sum_i (f(X_i) - \hat\mu_N)^2$ is an unbiased estimator of $\sigma^2 = \text{Var}_\pi[f(X)]$ , and explain why the $1/(N-1)$ rather than $1/N$ correction is needed.

ExerciseCore

Problem

A control variate $h$ has $\mathbb{E}_\pi[h(X)] = \nu$ and correlation $\rho$ with $f(X)$ . Derive the optimal coefficient $c^*$ and the resulting variance reduction factor $1 - \rho^2$ .

ExerciseAdvanced

Problem

You want to estimate $\mathbb{E}[e^{X}]$ where $X \sim \mathcal{N}(0, \sigma^2)$ . The exact answer is $e^{\sigma^2/2}$ . Show that the variance of the plain Monte Carlo estimator scales like $e^{2\sigma^2} - e^{\sigma^2}$ , and conclude that for $\sigma^2 = 10$ you need $N \sim 10^9$ samples to match a 1% relative error.

References

Canonical:

Robert & Casella, Monte Carlo Statistical Methods (2nd ed., 2004), Chapters 1-3. The standard graduate-level reference for plain Monte Carlo, importance sampling, and variance reduction.
Owen, Monte Carlo Theory, Methods and Examples (2013, online at statweb.stanford.edu/~owen/mc/), Chapters 1-9. Free, comprehensive, with a clear treatment of estimator efficiency.
Liu, Monte Carlo Strategies in Scientific Computing (2001), Chapters 1-4.

Statistical foundations:

Casella & Berger, Statistical Inference (2nd ed., 2002), Chapter 5 (sampling distributions, unbiased estimation), Chapter 7 (point estimation, Rao-Blackwell).
Lehmann & Casella, Theory of Point Estimation (2nd ed., 1998), Chapters 1-2.

ML applications:

Murphy, Probabilistic Machine Learning: Advanced Topics (2023), Chapter 11 (Monte Carlo inference).
Bishop, Pattern Recognition and Machine Learning (2006), Chapter 11 (sampling methods).
Sutton & Barto, Reinforcement Learning (2nd ed., 2018), Chapter 13 (policy-gradient methods, REINFORCE, baselines as control variates).
Kingma & Welling, "Auto-Encoding Variational Bayes" (2013, arXiv:1312.6114). Source paper for the reparameterization trick as a variance-reduction device for ELBO gradient estimators.

Next Topics

The natural next steps from plain Monte Carlo:

Importance Sampling: when you can't sample from $\pi$ but can evaluate it, sample from a proposal and reweight. The cornerstone of off-policy RL and variational diagnostics.
Rejection Sampling: produce exact iid samples from $\pi$ when an envelope is available. The pedagogical bridge to MCMC.
Markov Chain Monte Carlo: relax iid sampling. Build a Markov chain whose stationary distribution is the target.
Metropolis-Hastings: the foundational MCMC construction for sampling unnormalized posteriors.

Last reviewed: May 6, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
Central Limit Theoremlayer 0B · tier 1
Law of Large Numberslayer 0B · tier 1

Derived topics

1

Markov Chain Monte Carlolayer 2 · tier 1

Graph-backed continuations

Markov Chain Monte Carlo

Why This Matters

Mental Model

Formal Setup and Notation

Main Theorems

Why N\sqrt{N}N​ Is Dimension-Free

Variance Reduction: Tactics for Shrinking σ2\sigma^2σ2

Where Monte Carlo Lives in ML

Common Confusions

Canonical Examples

Exercises

References

Next Topics

Required before and derived from this topic

Required prerequisites

Derived topics

Why $\sqrt{N}$ Is Dimension-Free

Variance Reduction: Tactics for Shrinking $\sigma^2$