Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Statistical Estimation

Asymptotic Statistics

The large-sample toolbox: delta method, Slutsky's theorem, asymptotic normality of MLE, local asymptotic normality, and Fisher efficiency. These results justify nearly every confidence interval and hypothesis test used in practice.

AdvancedTier 3Stable~65 min

Why This Matters

Almost every confidence interval, standard error, and pp-value computed relies on asymptotic theory. When you report θ^±1.96SE\hat\theta \pm 1.96 \cdot \text{SE}, you are implicitly invoking the asymptotic normality of your estimator. When you use the likelihood ratio test, you are relying on its asymptotic χ2\chi^2 distribution.

Asymptotic statistics is not just about letting nn \to \infty. It provides finite-sample approximations that are often remarkably accurate, and it tells you the fundamental limits of estimation via Fisher information and efficiency.

Mental Model

The central limit theorem says that sample means are approximately normal for large nn. Asymptotic statistics extends this in two directions. First, the delta method lets you transfer normality through smooth transformations: if Xˉ\bar{X} is approximately normal, then g(Xˉ)g(\bar{X}) is too, with a variance you can compute. Second, maximum likelihood estimators inherit CLT-like normality under regularity conditions, achieving the best possible variance (the Cramér-Rao bound) in the limit.

Formal Setup and Notation

We write XndXX_n \xrightarrow{d} X for convergence in distribution, XnpcX_n \xrightarrow{p} c for convergence in probability, and Xn=Op(an)X_n = O_p(a_n) to mean Xn/anX_n / a_n is bounded in probability.

Main Theorems

Theorem

Delta Method

Statement

If n(Xnμ)dN(0,σ2)\sqrt{n}(X_n - \mu) \xrightarrow{d} N(0, \sigma^2) and gg is differentiable at μ\mu with g(μ)0g'(\mu) \neq 0, then:

n(g(Xn)g(μ))dN(0,σ2[g(μ)]2)\sqrt{n}(g(X_n) - g(\mu)) \xrightarrow{d} N(0, \sigma^2 [g'(\mu)]^2)

Intuition

A smooth function of an approximately normal random variable is itself approximately normal. The variance gets multiplied by the squared derivative because gg locally looks like a linear function with slope g(μ)g'(\mu).

Proof Sketch

Taylor expand g(Xn)g(μ)+g(μ)(Xnμ)g(X_n) \approx g(\mu) + g'(\mu)(X_n - \mu). Then n(g(Xn)g(μ))g(μ)n(Xnμ)\sqrt{n}(g(X_n) - g(\mu)) \approx g'(\mu) \cdot \sqrt{n}(X_n - \mu). The remainder term is op(1)o_p(1) because XnpμX_n \xrightarrow{p} \mu and gg'' is locally bounded. Apply Slutsky's theorem to conclude.

Why It Matters

The delta method is the workhorse of applied statistics. Need the variance of log(p^)\log(\hat{p})? Of 1/Xˉ1/\bar{X}? Of p^1/p^2\hat{p}_1 / \hat{p}_2? The delta method gives you the answer immediately without simulation. It is also the basis for constructing confidence intervals on transformed parameters.

Failure Mode

When g(μ)=0g'(\mu) = 0, the first-order delta method gives a degenerate limit. You need the second-order delta method: if g(μ)=0g'(\mu) = 0 and g(μ)0g''(\mu) \neq 0, then n(g(Xn)g(μ))d12σ2g(μ)χ12n(g(X_n) - g(\mu)) \xrightarrow{d} \frac{1}{2}\sigma^2 g''(\mu) \chi^2_1. The rate changes from n\sqrt{n} to nn and the limit is no longer normal.

Theorem

Slutsky's Theorem

Statement

If XndXX_n \xrightarrow{d} X and YnpcY_n \xrightarrow{p} c (a constant), then:

Xn+YndX+cX_n + Y_n \xrightarrow{d} X + c

XnYndcXX_n \cdot Y_n \xrightarrow{d} c \cdot X

Xn/YndX/c(when c0)X_n / Y_n \xrightarrow{d} X / c \quad \text{(when } c \neq 0\text{)}

Intuition

A sequence converging in probability to a constant behaves like a constant in the limit. You can add, multiply, or divide by it without disrupting distributional convergence. The key is that YnY_n must converge to a constant, not a random variable.

Proof Sketch

Write (Xn,Yn)(X_n, Y_n) on a product space. Since YnpcY_n \xrightarrow{p} c, the pair (Xn,Yn)d(X,c)(X_n, Y_n) \xrightarrow{d} (X, c) by a standard coupling argument. Continuous mapping theorem then gives f(Xn,Yn)df(X,c)f(X_n, Y_n) \xrightarrow{d} f(X, c) for continuous ff.

Why It Matters

Slutsky's theorem is the glue that holds asymptotic arguments together. Every time you replace σ\sigma with σ^\hat\sigma in a zz-statistic and claim the limit is still standard normal, you are using Slutsky.

Failure Mode

Slutsky fails if YnY_n converges in distribution to a non-degenerate random variable. In that case, you need joint convergence (Xn,Yn)d(X,Y)(X_n, Y_n) \xrightarrow{d} (X, Y) and the continuous mapping theorem.

Theorem

Asymptotic Normality of MLE

Statement

Under regularity conditions, the maximum likelihood estimator θ^n\hat\theta_n satisfies:

n(θ^nθ0)dN(0,I(θ0)1)\sqrt{n}(\hat\theta_n - \theta_0) \xrightarrow{d} N(0, I(\theta_0)^{-1})

where I(θ0)=E[2logf(X;θ0)]I(\theta_0) = -\mathbb{E}[\nabla^2 \log f(X; \theta_0)] is the Fisher information matrix.

Intuition

The MLE is approximately normal centered at the truth, with variance equal to the inverse Fisher information divided by nn. This is the best variance any regular estimator can achieve (Cramér-Rao bound), so the MLE is asymptotically efficient.

Proof Sketch

Expand the score equation n(θ^n)=0\nabla \ell_n(\hat\theta_n) = 0 around θ0\theta_0: 0=n(θ0)+2n(θ~)(θ^nθ0)0 = \nabla \ell_n(\theta_0) + \nabla^2 \ell_n(\tilde\theta)(\hat\theta_n - \theta_0). By the CLT, 1nn(θ0)dN(0,I(θ0))\frac{1}{\sqrt{n}}\nabla \ell_n(\theta_0) \xrightarrow{d} N(0, I(\theta_0)). By the LLN, 1n2n(θ~)pI(θ0)\frac{1}{n}\nabla^2 \ell_n(\tilde\theta) \xrightarrow{p} -I(\theta_0). Combine via Slutsky to get the result.

Why It Matters

This theorem justifies the standard practice of reporting MLE point estimates with standard errors computed from the observed Fisher information. It is the theoretical backbone of likelihood-based inference.

Failure Mode

Regularity conditions fail at boundary parameters (e.g., variance = 0), non-identifiable models (e.g., mixture models with unknown number of components), and models where the support depends on the parameter (e.g., Uniform(0,θ)\text{Uniform}(0, \theta)). In these cases the MLE may have non-normal limits or converge at non-n\sqrt{n} rates.

Proposition

Local Asymptotic Normality

Statement

Under regularity conditions, the log-likelihood ratio admits the expansion:

logLn(θ0+h/n)Ln(θ0)=hΔn12hI(θ0)h+op(1)\log \frac{L_n(\theta_0 + h/\sqrt{n})}{L_n(\theta_0)} = h^\top \Delta_n - \frac{1}{2} h^\top I(\theta_0) h + o_p(1)

where Δn=1ni=1nlogf(Xi;θ0)dN(0,I(θ0))\Delta_n = \frac{1}{\sqrt{n}} \sum_{i=1}^n \nabla \log f(X_i; \theta_0) \xrightarrow{d} N(0, I(\theta_0)).

Intuition

At the local scale h/nh/\sqrt{n} around the truth, the statistical experiment looks like a Gaussian shift experiment. The sufficient statistic is Δn\Delta_n (the normalized score), and the Fisher information determines the signal strength. This is the deepest structural result in parametric statistics.

Proof Sketch

Taylor expand the log-likelihood ratio to second order. The first-order term gives Δnh\Delta_n^\top h and the second-order term gives 12hI(θ0)h-\frac{1}{2}h^\top I(\theta_0) h after applying the law of large numbers to the Hessian. Higher-order terms vanish in probability.

Why It Matters

LAN shows that, asymptotically, all regular parametric problems reduce to Gaussian location problems. This unifies the theory of efficient estimation and optimal testing. It also implies the asymptotic minimax lower bound: no regular estimator can beat the Fisher information bound.

Failure Mode

LAN fails for non-regular models where the Fisher information is zero or infinite, and for semiparametric or nonparametric models without finite-dimensional sufficient statistics.

Core Definitions

Definition

Contiguity

Two sequences of probability measures PnP_n and QnQ_n are contiguous if Pn(An)0P_n(A_n) \to 0 implies Qn(An)0Q_n(A_n) \to 0 for every sequence of measurable sets AnA_n. In the LAN framework, Pθ0+h/nP_{\theta_0 + h/\sqrt{n}} and Pθ0P_{\theta_0} are mutually contiguous. This means tests that are consistent under θ0\theta_0 remain well-behaved under local alternatives.

Definition

Asymptotic Relative Efficiency

The asymptotic relative efficiency of estimator TnT_n relative to SnS_n is:

ARE(Tn,Sn)=Varasy(Sn)Varasy(Tn)\text{ARE}(T_n, S_n) = \frac{\text{Var}_{\text{asy}}(S_n)}{\text{Var}_{\text{asy}}(T_n)}

If ARE>1\text{ARE} > 1, then TnT_n is more efficient. The MLE achieves the maximum ARE of 1 relative to the Cramér-Rao bound.

Canonical Examples

Example

Delta method: variance-stabilizing transform for Poisson

If X1,,XnPoisson(λ)X_1, \ldots, X_n \sim \text{Poisson}(\lambda), then n(Xˉλ)dN(0,λ)\sqrt{n}(\bar{X} - \lambda) \xrightarrow{d} N(0, \lambda). The variance depends on λ\lambda. Apply the delta method with g(x)=xg(x) = \sqrt{x}: n(Xˉλ)dN(0,1/4)\sqrt{n}(\sqrt{\bar{X}} - \sqrt{\lambda}) \xrightarrow{d} N(0, 1/4). The variance 1/41/4 is now constant, independent of λ\lambda. This is the variance-stabilizing transformation.

Example

MLE for exponential rate

If X1,,XnExp(λ)X_1, \ldots, X_n \sim \text{Exp}(\lambda), the MLE is λ^=1/Xˉ\hat\lambda = 1/\bar{X}. Fisher information is I(λ)=1/λ2I(\lambda) = 1/\lambda^2. By asymptotic normality: n(λ^λ)dN(0,λ2)\sqrt{n}(\hat\lambda - \lambda) \xrightarrow{d} N(0, \lambda^2). A 95% confidence interval is λ^±1.96λ^/n\hat\lambda \pm 1.96 \hat\lambda / \sqrt{n}.

Common Confusions

Watch Out

Asymptotic normality does not mean the estimator is normal

The MLE is approximately normal for large nn. For small nn, the actual distribution can be heavily skewed. Bootstrap or exact methods may be needed for small samples. The asymptotic approximation is a tool, not a fact about the estimator's distribution.

Watch Out

Efficiency is an asymptotic concept

An estimator can be asymptotically efficient yet perform poorly in small samples. Conversely, a biased estimator like James-Stein can dominate the MLE in finite samples while being asymptotically equivalent.

Summary

  • Delta method: Var(g(Xn))[g(μ)]2Var(Xn)\text{Var}(g(X_n)) \approx [g'(\mu)]^2 \text{Var}(X_n)
  • Slutsky: convergence in probability to a constant acts like a constant
  • MLE is asymptotically normal with variance I(θ0)1/nI(\theta_0)^{-1}/n
  • LAN: locally, all regular parametric problems look Gaussian
  • Fisher information sets the fundamental efficiency limit

Exercises

ExerciseCore

Problem

Let X1,,XnBernoulli(p)X_1, \ldots, X_n \sim \text{Bernoulli}(p) and let p^=Xˉ\hat{p} = \bar{X}. Use the delta method to find the asymptotic distribution of log(p^/(1p^))\log(\hat{p}/(1-\hat{p})), the log-odds.

ExerciseAdvanced

Problem

Explain why the asymptotic normality of MLE fails for X1,,XnUniform(0,θ)X_1, \ldots, X_n \sim \text{Uniform}(0, \theta). What is the actual rate of convergence of θ^MLE=X(n)\hat\theta_{\text{MLE}} = X_{(n)}?

ExerciseResearch

Problem

In the LAN expansion, the experiment at local scale h/nh/\sqrt{n} looks like observing ΔnN(I(θ0)h,I(θ0))\Delta_n \sim N(I(\theta_0) h, I(\theta_0)). Show that the optimal estimator of hh in this Gaussian experiment is h^=I(θ0)1Δn\hat{h} = I(\theta_0)^{-1} \Delta_n and that its risk equals I(θ0)1I(\theta_0)^{-1}, recovering the Cramér-Rao bound.

References

Canonical:

  • van der Vaart, Asymptotic Statistics (1998), Chapters 2-7
  • Lehmann & Casella, Theory of Point Estimation (1998), Chapter 6

Current:

  • Wasserman, All of Statistics (2004), Chapters 9-10

  • Keener, Theoretical Statistics (2010), Chapters 7-9

  • van der Vaart, Asymptotic Statistics (1998), Chapters 2-8

Next Topics

Natural extensions from asymptotic statistics:

  • Semiparametric efficiency: efficiency theory when nuisance parameters are infinite-dimensional
  • Bootstrap methods: resampling as an alternative to asymptotic approximation
  • Higher-order asymptotics: Edgeworth expansions and Bartlett corrections

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.