Statistical Estimation
Central Limit Theorem
The CLT: the sample mean is approximately Gaussian for large n, regardless of the original distribution. Berry-Esseen rate, multivariate CLT, and why CLT explains asymptotic normality of MLE, confidence intervals, and the ubiquity of the Gaussian.
Prerequisites
Why This Matters
The central limit theorem explains why the Gaussian distribution appears everywhere. It is not because nature is inherently Gaussian --- it is because many quantities of interest are averages (or behave like averages), and the CLT says that averages are approximately Gaussian regardless of the distribution of the individual terms.
In machine learning and statistics, the CLT is the foundation for:
- Asymptotic normality of MLE: the maximum likelihood estimator is approximately Gaussian for large , with variance given by the inverse Fisher information. This is a direct consequence of the CLT.
- Confidence intervals: the classic interval relies on the CLT to justify using Gaussian quantiles.
- Hypothesis testing: test statistics (t-tests, z-tests, Wald tests) are approximately Gaussian by the CLT, which determines rejection thresholds.
- Monte Carlo error bars: the CLT tells you that Monte Carlo estimates have approximately Gaussian errors, enabling standard error calculations.
Mental Model
The LLN tells you that the sample mean converges. The CLT tells you how: the fluctuations around the true mean are Gaussian, with standard deviation . If you zoom in on the convergence by multiplying by , the magnified fluctuations have a specific, universal shape --- the bell curve.
The CLT is a universality result: no matter what distribution the come from (as long as they have finite variance), the standardized average looks the same. This universality is why the Gaussian is ubiquitous.
Formal Setup
Convergence in Distribution
A sequence converges in distribution to if for every continuous bounded function :
Equivalently: at every continuity point of .
Convergence in distribution is the weakest mode of convergence. It says the distributions of approach the distribution of , but the random variables and need not live on the same probability space or be "close" in any pathwise sense.
Main Theorems
Classical Central Limit Theorem
Statement
If are i.i.d. with mean and variance , then:
Equivalently: .
In practical terms: for large , .
Intuition
Each contributes a random "kick" to the sum. For large , the aggregate effect of many small independent kicks is Gaussian --- this is the essence of the CLT. The kicks can come from any distribution (Bernoulli, Exponential, Poisson, anything with finite variance), and the result is always the same bell curve. The only things that matter are the mean (which centers the curve) and the variance (which sets the width).
The scaling is crucial: the sum grows as , the standard deviation grows as , so the standardized version remains . This is the "zoom" that reveals the Gaussian structure.
Proof Sketch
(Via characteristic functions): The characteristic function of is . By Taylor expansion near :
The characteristic function of is:
Let be the characteristic function of . Then:
and as . This is the characteristic function of . By Levy's continuity theorem, .
Why It Matters
The CLT is the single most used theorem in statistics. Every time you compute a confidence interval, perform a hypothesis test, or report a standard error, you are (implicitly or explicitly) invoking the CLT.
In ML theory, the CLT appears in:
- Asymptotic normality of MLE: under regularity conditions, where is the Fisher information
- Asymptotic theory of M-estimators: any estimator defined as the minimizer of an empirical objective is asymptotically normal under smoothness conditions
- Bootstrap validity: the bootstrap works because the CLT ensures the sampling distribution is approximately Gaussian
Failure Mode
The CLT requires finite variance (). For heavy-tailed distributions with infinite variance (e.g., stable distributions with index ), the sum converges to a non-Gaussian stable distribution. The CLT also gives no information about how large needs to be for the approximation to be accurate. For highly skewed or heavy-tailed distributions, the Gaussian approximation can be poor even for moderately large . The Berry-Esseen bound addresses this.
Berry-Esseen Theorem
Statement
If are i.i.d. with , , and , then:
where is the standard normal CDF and is a universal constant (, due to Shevtsova 2011).
The rate of convergence in the CLT is .
Intuition
Berry-Esseen quantifies what the CLT leaves vague: the Gaussian approximation is accurate to within in the CDF. The constant depends on the ratio , which measures how "non-Gaussian" the original distribution is. For symmetric distributions, the constant is smaller. For highly skewed distributions, the constant is larger, and you need more samples for the Gaussian approximation to kick in.
Proof Sketch
The proof uses smoothing techniques for characteristic functions. The key steps are:
-
Bound the difference of CDFs using the smoothing inequality:
-
Use the Taylor expansion of the characteristic function to third order to bound
-
Optimize over to get the bound
Why It Matters
Berry-Esseen tells you when the CLT approximation is good enough to trust. For the Bernoulli distribution with : , and the bound gives an error of at most . For , the error is at most 0.095 --- about a 10% error in the CDF. For , the error is at most 0.0095.
The rate is tight: it cannot be improved in general. This means that for CLT-based confidence intervals to be accurate to 1%, you need . In practice, the CLT approximation is often better than the worst case, especially for symmetric distributions.
Failure Mode
Berry-Esseen requires a finite third moment. If (but ), the CLT still holds but the rate of convergence can be slower than . Also, the Berry-Esseen bound is a uniform bound over all ; for tail probabilities (large ), the relative error of the Gaussian approximation can be much worse.
Multivariate CLT
Multivariate Central Limit Theorem
If are i.i.d. with mean and covariance matrix (positive definite), then:
This says: the joint distribution of the standardized sample mean vector converges to a multivariate Gaussian with the same covariance structure. The Cramér-Wold theorem provides the key tool for proving multivariate convergence: it suffices to show that every one-dimensional projection converges.
Application: The asymptotic distribution of the MLE is , which is a direct consequence of the multivariate CLT applied to the score function .
Canonical Examples
Sample mean of Exponential variables
Let with and . By the CLT, .
For : . A 95% confidence interval for is .
The Exponential distribution is skewed (skewness = 2), so the Gaussian approximation is not perfect for . Berry-Esseen with gives an error bound of --- the approximation is rough. By , the bound drops to , and by , to .
CLT for Bernoulli: the normal approximation to the Binomial
If , then . The CLT gives:
This is the normal approximation to the Binomial. The rule of thumb and ensures the approximation is reasonable.
For and : . The probability . The exact Binomial probability is . The CLT approximation is close but not exact for tail probabilities.
Common Confusions
The CLT does NOT say individual observations are Gaussian
The CLT says the average is approximately Gaussian for large . Individual observations retain their original distribution --- they could be Bernoulli, Exponential, Poisson, or anything else. The Gaussian emerges only through averaging. A single coin flip is not Gaussian; the average of a million coin flips is.
The CLT is an asymptotic result, not a finite-sample guarantee
The CLT says as . It does not say the approximation is good for any specific . For with a highly skewed distribution, the Gaussian approximation can be terrible. Berry-Esseen gives a finite-sample bound, but even that bound can be loose. In practice, use the CLT only when is "large enough," which depends on the distribution.
Convergence in distribution does not imply convergence in probability
The CLT gives convergence in distribution: where . This does not mean in any sense. The random variables are determined by the data; is a "fresh" Gaussian that has nothing to do with the data. Convergence in distribution is a statement about the shape of the distribution, not about the values of the random variables.
Finite variance is required, not just finite mean
The LLN requires only a finite mean. The CLT requires a finite variance. For distributions with finite mean but infinite variance (e.g., stable distributions with index ), the sample mean still converges (by the LLN) but the CLT does not hold in its standard form. The normalized sum converges to a non-Gaussian stable distribution instead.
Summary
- CLT: for i.i.d. variables with finite variance
- The CLT gives the rate of LLN convergence: fluctuations are
- Berry-Esseen: the CDF error is , with constant depending on the third moment
- Multivariate CLT:
- CLT explains: asymptotic normality of MLE, confidence intervals, hypothesis tests, and why the Gaussian is everywhere
- CLT does not say individual observations are Gaussian --- only the average is
- Finite variance is required; infinite variance gives non-Gaussian limits
Exercises
Problem
A factory produces bolts whose lengths have mean cm and standard deviation cm (the distribution of lengths is unknown). You measure bolts and compute . Using the CLT, find an interval that contains with approximately 95% probability.
Problem
Let with unknown. The MLE is . Use the CLT to derive the asymptotic distribution of and construct an asymptotic 95% confidence interval for . Why must you use in the variance estimate, and what does the delta method give for ?
Problem
The CLT assumes finite variance. What happens to the normalized sum when has a symmetric stable distribution with index ? Why does this matter for financial data, and how does it affect standard ML algorithms that assume sub-Gaussian tails?
Related Comparisons
References
Canonical:
- Durrett, Probability: Theory and Examples (5th ed., 2019), Section 3.4
- Billingsley, Probability and Measure (3rd ed., 1995), Sections 27-28
- Feller, An Introduction to Probability Theory and Its Applications Vol. 2 (1971), Chapter XVI
Current:
-
Vershynin, High-Dimensional Probability (2018), Section 2.7
-
van der Vaart, Asymptotic Statistics (1998), Chapters 2-3
-
Casella & Berger, Statistical Inference (2002), Chapters 5-10
Next Topics
Building on the central limit theorem:
- Maximum likelihood estimation: asymptotic normality of the MLE as a CLT consequence
- Hypothesis testing for ML: test statistics and p-values via the Gaussian approximation
- Concentration inequalities: non-asymptotic, finite-sample alternatives to the CLT
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Law of Large NumbersLayer 0B
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
Builds on This
- Asymptotic StatisticsLayer 0B
- Cramér-Wold TheoremLayer 1