Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Statistical Estimation

Central Limit Theorem

The CLT: the sample mean is approximately Gaussian for large n, regardless of the original distribution. Berry-Esseen rate, multivariate CLT, and why CLT explains asymptotic normality of MLE, confidence intervals, and the ubiquity of the Gaussian.

CoreTier 1Stable~55 min

Why This Matters

N(0.5, 1/360) limit1234Density00.250.50.751Sample mean X̄n = 1 (uniform)n = 2 (triangular)n = 5n = 12As n grows, the distribution tightens and approaches Gaussian regardless of the original shape

The central limit theorem explains why the Gaussian distribution appears everywhere. It is not because nature is inherently Gaussian --- it is because many quantities of interest are averages (or behave like averages), and the CLT says that averages are approximately Gaussian regardless of the distribution of the individual terms.

In machine learning and statistics, the CLT is the foundation for:

  • Asymptotic normality of MLE: the maximum likelihood estimator is approximately Gaussian for large nn, with variance given by the inverse Fisher information. This is a direct consequence of the CLT.
  • Confidence intervals: the classic Xˉ±zα/2s/n\bar{X} \pm z_{\alpha/2} \cdot s/\sqrt{n} interval relies on the CLT to justify using Gaussian quantiles.
  • Hypothesis testing: test statistics (t-tests, z-tests, Wald tests) are approximately Gaussian by the CLT, which determines rejection thresholds.
  • Monte Carlo error bars: the CLT tells you that Monte Carlo estimates have approximately Gaussian errors, enabling standard error calculations.

Mental Model

The LLN tells you that the sample mean converges. The CLT tells you how: the fluctuations around the true mean are Gaussian, with standard deviation σ/n\sigma/\sqrt{n}. If you zoom in on the convergence by multiplying by n\sqrt{n}, the magnified fluctuations have a specific, universal shape --- the bell curve.

The CLT is a universality result: no matter what distribution the XiX_i come from (as long as they have finite variance), the standardized average looks the same. This universality is why the Gaussian is ubiquitous.

Formal Setup

Definition

Convergence in Distribution

A sequence XnX_n converges in distribution to XX if for every continuous bounded function gg:

limnE[g(Xn)]=E[g(X)]\lim_{n \to \infty} \mathbb{E}[g(X_n)] = \mathbb{E}[g(X)]

Equivalently: FXn(t)FX(t)F_{X_n}(t) \to F_X(t) at every continuity point tt of FXF_X.

Convergence in distribution is the weakest mode of convergence. It says the distributions of XnX_n approach the distribution of XX, but the random variables XnX_n and XX need not live on the same probability space or be "close" in any pathwise sense.

Main Theorems

Theorem

Classical Central Limit Theorem

Statement

If X1,X2,X_1, X_2, \ldots are i.i.d. with mean μ\mu and variance σ2(0,)\sigma^2 \in (0, \infty), then:

n(Xˉnμ)σdN(0,1)\frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} \xrightarrow{d} \mathcal{N}(0, 1)

Equivalently: n(Xˉnμ)dN(0,σ2)\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} \mathcal{N}(0, \sigma^2).

In practical terms: for large nn, XˉnN(μ,σ2/n)\bar{X}_n \approx \mathcal{N}(\mu, \sigma^2/n).

Intuition

Each XiX_i contributes a random "kick" to the sum. For large nn, the aggregate effect of many small independent kicks is Gaussian --- this is the essence of the CLT. The kicks can come from any distribution (Bernoulli, Exponential, Poisson, anything with finite variance), and the result is always the same bell curve. The only things that matter are the mean (which centers the curve) and the variance (which sets the width).

The n\sqrt{n} scaling is crucial: the sum Xi\sum X_i grows as nμn\mu, the standard deviation grows as σn\sigma\sqrt{n}, so the standardized version (Xinμ)/(σn)(\sum X_i - n\mu)/(\sigma\sqrt{n}) remains O(1)O(1). This is the "zoom" that reveals the Gaussian structure.

Proof Sketch

(Via characteristic functions): The characteristic function of XiX_i is φ(t)=E[eitXi]\varphi(t) = \mathbb{E}[e^{itX_i}]. By Taylor expansion near t=0t = 0:

φ(t)=1+itμt2(σ2+μ2)2+o(t2)\varphi(t) = 1 + it\mu - \frac{t^2(\sigma^2 + \mu^2)}{2} + o(t^2)

The characteristic function of Zn=n(Xˉnμ)/σZ_n = \sqrt{n}(\bar{X}_n - \mu)/\sigma is:

φZn(t)=[φ ⁣(tσn)eitμ/(σn)]n\varphi_{Z_n}(t) = \left[\varphi\!\left(\frac{t}{\sigma\sqrt{n}}\right) e^{-it\mu/(\sigma\sqrt{n})}\right]^n

Let ψ(t)=φ(t/(σn))eitμ/(σn)\psi(t) = \varphi(t/(\sigma\sqrt{n}))e^{-it\mu/(\sigma\sqrt{n})} be the characteristic function of (Xiμ)/(σn)(X_i - \mu)/(\sigma\sqrt{n}). Then:

ψ(t)=1t22n+o(t2/n)\psi(t) = 1 - \frac{t^2}{2n} + o(t^2/n)

and ψ(t)net2/2\psi(t)^n \to e^{-t^2/2} as nn \to \infty. This is the characteristic function of N(0,1)\mathcal{N}(0, 1). By Levy's continuity theorem, ZndN(0,1)Z_n \xrightarrow{d} \mathcal{N}(0, 1).

Why It Matters

The CLT is the single most used theorem in statistics. Every time you compute a confidence interval, perform a hypothesis test, or report a standard error, you are (implicitly or explicitly) invoking the CLT.

In ML theory, the CLT appears in:

  • Asymptotic normality of MLE: under regularity conditions, n(θ^MLEθ0)dN(0,I(θ0)1)\sqrt{n}(\hat{\theta}_{\text{MLE}} - \theta_0) \xrightarrow{d} \mathcal{N}(0, I(\theta_0)^{-1}) where I(θ0)I(\theta_0) is the Fisher information
  • Asymptotic theory of M-estimators: any estimator defined as the minimizer of an empirical objective is asymptotically normal under smoothness conditions
  • Bootstrap validity: the bootstrap works because the CLT ensures the sampling distribution is approximately Gaussian

Failure Mode

The CLT requires finite variance (σ2<\sigma^2 < \infty). For heavy-tailed distributions with infinite variance (e.g., stable distributions with index α<2\alpha < 2), the sum converges to a non-Gaussian stable distribution. The CLT also gives no information about how large nn needs to be for the approximation to be accurate. For highly skewed or heavy-tailed distributions, the Gaussian approximation can be poor even for moderately large nn. The Berry-Esseen bound addresses this.

Theorem

Berry-Esseen Theorem

Statement

If X1,,XnX_1, \ldots, X_n are i.i.d. with E[Xi]=0\mathbb{E}[X_i] = 0, Var(Xi)=σ2\text{Var}(X_i) = \sigma^2, and E[Xi3]=ρ<\mathbb{E}[|X_i|^3] = \rho < \infty, then:

suptPr ⁣[Xˉnσ/nt]Φ(t)Cρσ3n\sup_t \left| \Pr\!\left[\frac{\bar{X}_n}{\sigma/\sqrt{n}} \leq t\right] - \Phi(t) \right| \leq \frac{C\rho}{\sigma^3 \sqrt{n}}

where Φ\Phi is the standard normal CDF and CC is a universal constant (C0.4748C \leq 0.4748, due to Shevtsova 2011).

The rate of convergence in the CLT is O(1/n)O(1/\sqrt{n}).

Intuition

Berry-Esseen quantifies what the CLT leaves vague: the Gaussian approximation is accurate to within O(1/n)O(1/\sqrt{n}) in the CDF. The constant depends on the ratio ρ/σ3\rho/\sigma^3, which measures how "non-Gaussian" the original distribution is. For symmetric distributions, the constant is smaller. For highly skewed distributions, the constant is larger, and you need more samples for the Gaussian approximation to kick in.

Proof Sketch

The proof uses smoothing techniques for characteristic functions. The key steps are:

  1. Bound the difference of CDFs using the smoothing inequality: F(t)G(t)1πTTφF(s)φG(s)sds+O(1/T)|F(t) - G(t)| \leq \frac{1}{\pi}\int_{-T}^{T} \frac{|\varphi_F(s) - \varphi_G(s)|}{|s|}\,ds + O(1/T)

  2. Use the Taylor expansion of the characteristic function to third order to bound φZn(t)et2/2|\varphi_{Z_n}(t) - e^{-t^2/2}|

  3. Optimize over TT to get the O(1/n)O(1/\sqrt{n}) bound

Why It Matters

Berry-Esseen tells you when the CLT approximation is good enough to trust. For the Bernoulli distribution with p=1/2p = 1/2: ρ/σ3=2\rho/\sigma^3 = 2, and the bound gives an error of at most 0.95/n0.95/\sqrt{n}. For n=100n = 100, the error is at most 0.095 --- about a 10% error in the CDF. For n=10,000n = 10{,}000, the error is at most 0.0095.

The O(1/n)O(1/\sqrt{n}) rate is tight: it cannot be improved in general. This means that for CLT-based confidence intervals to be accurate to 1%, you need n10,000n \sim 10{,}000. In practice, the CLT approximation is often better than the worst case, especially for symmetric distributions.

Failure Mode

Berry-Esseen requires a finite third moment. If E[X3]=\mathbb{E}[|X|^3] = \infty (but Var(X)<\text{Var}(X) < \infty), the CLT still holds but the rate of convergence can be slower than O(1/n)O(1/\sqrt{n}). Also, the Berry-Esseen bound is a uniform bound over all tt; for tail probabilities (large t|t|), the relative error of the Gaussian approximation can be much worse.

Multivariate CLT

Definition

Multivariate Central Limit Theorem

If X1,,XnRdX_1, \ldots, X_n \in \mathbb{R}^d are i.i.d. with mean μRd\mu \in \mathbb{R}^d and covariance matrix ΣRd×d\Sigma \in \mathbb{R}^{d \times d} (positive definite), then:

n(Xˉnμ)dN(0,Σ)\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} \mathcal{N}(0, \Sigma)

This says: the joint distribution of the standardized sample mean vector converges to a multivariate Gaussian with the same covariance structure. The Cramér-Wold theorem provides the key tool for proving multivariate convergence: it suffices to show that every one-dimensional projection converges.

Application: The asymptotic distribution of the MLE θ^MLERd\hat{\theta}_{\text{MLE}} \in \mathbb{R}^d is N(θ0,I(θ0)1/n)\mathcal{N}(\theta_0, I(\theta_0)^{-1}/n), which is a direct consequence of the multivariate CLT applied to the score function θlogp(Xiθ)\nabla_\theta \log p(X_i | \theta).

Canonical Examples

Example

Sample mean of Exponential variables

Let XiExp(1)X_i \sim \text{Exp}(1) with μ=1\mu = 1 and σ2=1\sigma^2 = 1. By the CLT, n(Xˉn1)dN(0,1)\sqrt{n}(\bar{X}_n - 1) \xrightarrow{d} \mathcal{N}(0, 1).

For n=30n = 30: Xˉ30N(1,1/30)\bar{X}_{30} \approx \mathcal{N}(1, 1/30). A 95% confidence interval for μ\mu is Xˉ30±1.96/30Xˉ30±0.358\bar{X}_{30} \pm 1.96/\sqrt{30} \approx \bar{X}_{30} \pm 0.358.

The Exponential distribution is skewed (skewness = 2), so the Gaussian approximation is not perfect for n=30n = 30. Berry-Esseen with ρ/σ3=2\rho/\sigma^3 = 2 gives an error bound of 0.17\sim 0.17 --- the approximation is rough. By n=100n = 100, the bound drops to 0.095\sim 0.095, and by n=1000n = 1000, to 0.03\sim 0.03.

Example

CLT for Bernoulli: the normal approximation to the Binomial

If XiBern(p)X_i \sim \text{Bern}(p), then Sn=iXiBin(n,p)S_n = \sum_i X_i \sim \text{Bin}(n, p). The CLT gives:

Snnpnp(1p)dN(0,1)\frac{S_n - np}{\sqrt{np(1-p)}} \xrightarrow{d} \mathcal{N}(0, 1)

This is the normal approximation to the Binomial. The rule of thumb np5np \geq 5 and n(1p)5n(1-p) \geq 5 ensures the approximation is reasonable.

For p=0.5p = 0.5 and n=100n = 100: S100N(50,25)S_{100} \approx \mathcal{N}(50, 25). The probability Pr[S10060]Pr[Z2]0.023\Pr[S_{100} \geq 60] \approx \Pr[Z \geq 2] \approx 0.023. The exact Binomial probability is Pr[S10060]=0.0176\Pr[S_{100} \geq 60] = 0.0176. The CLT approximation is close but not exact for tail probabilities.

Common Confusions

Watch Out

The CLT does NOT say individual observations are Gaussian

The CLT says the average Xˉn\bar{X}_n is approximately Gaussian for large nn. Individual observations XiX_i retain their original distribution --- they could be Bernoulli, Exponential, Poisson, or anything else. The Gaussian emerges only through averaging. A single coin flip is not Gaussian; the average of a million coin flips is.

Watch Out

The CLT is an asymptotic result, not a finite-sample guarantee

The CLT says ZndN(0,1)Z_n \xrightarrow{d} \mathcal{N}(0,1) as nn \to \infty. It does not say the approximation is good for any specific nn. For n=5n = 5 with a highly skewed distribution, the Gaussian approximation can be terrible. Berry-Esseen gives a finite-sample bound, but even that bound can be loose. In practice, use the CLT only when nn is "large enough," which depends on the distribution.

Watch Out

Convergence in distribution does not imply convergence in probability

The CLT gives convergence in distribution: ZndZZ_n \xrightarrow{d} Z where ZN(0,1)Z \sim \mathcal{N}(0,1). This does not mean ZnZ0Z_n - Z \to 0 in any sense. The random variables ZnZ_n are determined by the data; ZZ is a "fresh" Gaussian that has nothing to do with the data. Convergence in distribution is a statement about the shape of the distribution, not about the values of the random variables.

Watch Out

Finite variance is required, not just finite mean

The LLN requires only a finite mean. The CLT requires a finite variance. For distributions with finite mean but infinite variance (e.g., stable distributions with index 1<α<21 < \alpha < 2), the sample mean still converges (by the LLN) but the CLT does not hold in its standard form. The normalized sum converges to a non-Gaussian stable distribution instead.

Summary

  • CLT: n(Xˉnμ)/σdN(0,1)\sqrt{n}(\bar{X}_n - \mu)/\sigma \xrightarrow{d} \mathcal{N}(0, 1) for i.i.d. variables with finite variance
  • The CLT gives the rate of LLN convergence: fluctuations are O(σ/n)O(\sigma/\sqrt{n})
  • Berry-Esseen: the CDF error is O(1/n)O(1/\sqrt{n}), with constant depending on the third moment
  • Multivariate CLT: n(Xˉnμ)dN(0,Σ)\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} \mathcal{N}(0, \Sigma)
  • CLT explains: asymptotic normality of MLE, confidence intervals, hypothesis tests, and why the Gaussian is everywhere
  • CLT does not say individual observations are Gaussian --- only the average is
  • Finite variance is required; infinite variance gives non-Gaussian limits

Exercises

ExerciseCore

Problem

A factory produces bolts whose lengths have mean μ=10\mu = 10 cm and standard deviation σ=0.5\sigma = 0.5 cm (the distribution of lengths is unknown). You measure n=100n = 100 bolts and compute Xˉ100\bar{X}_{100}. Using the CLT, find an interval that contains μ\mu with approximately 95% probability.

ExerciseAdvanced

Problem

Let X1,,XnBern(p)X_1, \ldots, X_n \sim \text{Bern}(p) with pp unknown. The MLE is p^=Xˉn\hat{p} = \bar{X}_n. Use the CLT to derive the asymptotic distribution of p^\hat{p} and construct an asymptotic 95% confidence interval for pp. Why must you use p^\hat{p} in the variance estimate, and what does the delta method give for g(p)=log(p/(1p))g(p) = \log(p/(1-p))?

ExerciseResearch

Problem

The CLT assumes finite variance. What happens to the normalized sum Sn/n1/αS_n / n^{1/\alpha} when XiX_i has a symmetric stable distribution with index α(0,2)\alpha \in (0, 2)? Why does this matter for financial data, and how does it affect standard ML algorithms that assume sub-Gaussian tails?

Related Comparisons

References

Canonical:

  • Durrett, Probability: Theory and Examples (5th ed., 2019), Section 3.4
  • Billingsley, Probability and Measure (3rd ed., 1995), Sections 27-28
  • Feller, An Introduction to Probability Theory and Its Applications Vol. 2 (1971), Chapter XVI

Current:

  • Vershynin, High-Dimensional Probability (2018), Section 2.7

  • van der Vaart, Asymptotic Statistics (1998), Chapters 2-3

  • Casella & Berger, Statistical Inference (2002), Chapters 5-10

Next Topics

Building on the central limit theorem:

  • Maximum likelihood estimation: asymptotic normality of the MLE as a CLT consequence
  • Hypothesis testing for ML: test statistics and p-values via the Gaussian approximation
  • Concentration inequalities: non-asymptotic, finite-sample alternatives to the CLT

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics