Statistical Estimation
Maximum Likelihood Estimation
MLE: find the parameter that maximizes the likelihood of observed data. Consistency, asymptotic normality, Fisher information, Cramér-Rao efficiency, and when MLE fails.
Why This Matters
Maximum likelihood estimation is the most widely used method for fitting parametric models to data. When you train a logistic regression by minimizing cross-entropy loss, you are doing MLE. When you fit a Gaussian mixture model, you are doing MLE (via EM). When you train a language model by minimizing perplexity, you are doing MLE.
Understanding MLE theory answers three fundamental questions:
- Does it work?: Consistency: as , the MLE converges to the true parameter.
- How accurate is it?: Asymptotic normality: the MLE is approximately Gaussian with variance , where is the Fisher information.
- Can you do better?: The Cramér-Rao bound says no regular unbiased estimator can have smaller variance than . MLE achieves this bound asymptotically. It is efficient.
Mental Model
You observe data from a distribution with unknown parameter . The MLE asks: which parameter makes the observed data most likely?
Formally, you maximize the likelihood function , or equivalently minimize the negative log-likelihood .
The intuition: among all possible parameter values, pick the one under which your data looks the least surprising. The log-likelihood turns products into sums, making optimization tractable and connecting MLE to information theory (the log-likelihood is the empirical KL divergence up to a constant).
Formal Setup and Notation
Let be i.i.d. from where . The parametric family is the statistical model.
Likelihood Function
The likelihood function is:
It is a function of (not of the data. The data is fixed). Despite looking like a joint density, is not a probability distribution over ; it is not normalized and does not integrate to 1.
Log-Likelihood
The log-likelihood is:
The log transform converts products to sums, which is essential for both computation (numerical stability) and theory (sums of i.i.d. terms are amenable to the law of large numbers and CLT).
Maximum Likelihood Estimator
The maximum likelihood estimator is:
Equivalently, it minimizes the negative log-likelihood. When the model is well-specified (), MLE minimizes the empirical KL divergence from the data distribution to the model.
Core Definitions
Score Function
The score function is the gradient of the log-likelihood with respect to :
Key property: . The score has mean zero at the true parameter. This follows from differentiating with respect to .
The MLE satisfies (the score equation), which is the first-order optimality condition for the log-likelihood.
Fisher Information
The Fisher information at is:
Under regularity conditions, this equals:
In the scalar case (): .
The Fisher information measures how much information each observation carries about . High Fisher information means the likelihood is sharply peaked, so is easy to estimate. Low Fisher information means the likelihood is flat, so is hard to pin down.
Observed Fisher Information
The observed Fisher information is:
This is the data-dependent (random) version of the Fisher information. By the law of large numbers, as . In practice, you use to estimate for confidence intervals.
Main Theorems
Regularity Conditions (Cramér's Conditions)
The asymptotic results below (consistency, asymptotic normality, efficiency) require a standard set of regularity conditions on the parametric family . The canonical list (van der Vaart 1998, Ch. 5; Lehmann and Casella 1998, Ch. 6):
- Identifiability: .
- Common support: the support of does not depend on .
- Interior: the true parameter lies in the interior of the parameter space .
- Smoothness: is three-times continuously differentiable in , with third derivatives dominated by an integrable function.
- Interchange: differentiation under the integral sign is valid (so score and Fisher information identities hold).
- Fisher information: is finite and positive definite at .
When any of these fails, the asymptotic distribution of the MLE can change or cease to exist. Boundary cases, non-identifiable models, and families with parameter-dependent support are the most common failure scenarios.
Consistency of MLE
Statement
Under regularity conditions (identifiability, compactness of , and continuity of in ), the MLE is consistent:
as .
Intuition
The normalized log-likelihood converges to by the law of large numbers. This expected value is maximized uniquely at by the information inequality (Gibbs' inequality): for any ,
because . So the limiting objective has a unique maximum at , and the maximizer of the empirical version converges to it.
Proof Sketch
Step 1: By the law of large numbers, for each .
Step 2: Under compactness and continuity, this convergence is uniform in : .
Step 3: The limit has a unique maximizer at (by Gibbs' inequality / KL positivity).
Step 4: By a standard argument (maximizers of uniformly converging functions converge to the maximizer of the limit), .
Why It Matters
Consistency is the minimal requirement for any estimator: with enough data, you recover the truth. MLE achieves this under mild conditions. Consistency requires the model to be identifiable: different parameters must give different distributions. If the model is overparameterized (as in neural networks), the MLE is not unique, and consistency of the parameter vector fails (though the fitted distribution may still converge).
Failure Mode
Consistency can fail if: (1) the model is not identifiable (e.g., mixture models with label switching), (2) the parameter space is not compact and the MLE drifts to the boundary (e.g., variance estimate in a mixture with a component collapsing to a point), or (3) the number of parameters grows with (the "Neyman-Scott problem": with means and one variance, the MLE of the variance is inconsistent).
Asymptotic Normality of MLE
Statement
Under regularity conditions, the MLE is asymptotically normal:
Equivalently, for large :
The variance is the Cramér-Rao lower bound, achieved by the MLE in the limit.
Intuition
The log-likelihood is a sum of i.i.d. terms. Near , a Taylor expansion gives:
Setting and solving: .
The numerator is a sum of i.i.d. mean-zero terms with variance , so by the CLT: .
The denominator by the LLN.
Combining: .
Proof Sketch
Step 1 (Taylor expand). Expand the score at around : for some between and .
Step 2 (Solve). Rearrange: .
Step 3 (CLT + LLN). By CLT: . By LLN + consistency: .
Step 4 (Slutsky). By Slutsky's theorem: .
Why It Matters
Asymptotic normality is the basis for:
- Confidence intervals:
- Hypothesis tests: Wald test, likelihood ratio test, score test
- Model comparison: AIC, BIC (which penalize the number of parameters using the asymptotic distribution of the log-likelihood)
It also shows that MLE is asymptotically efficient: its variance matches the Cramér-Rao lower bound.
Failure Mode
Asymptotic normality fails when: (1) is on the boundary of (e.g., estimating a variance that could be zero), (2) the Fisher information is zero or degenerate (flat likelihood), (3) the model is misspecified (the asymptotic distribution changes. The "sandwich estimator" is needed), or (4) the sample size is too small for the approximation to be accurate (especially in high dimensions).
Cramér-Rao Lower Bound
Statement
Let be any unbiased estimator of (i.e., for all ). Then:
In the multivariate case (), the covariance matrix satisfies in the Loewner order.
An estimator that achieves equality is called efficient.
Intuition
The Cramér-Rao bound quantifies the fundamental limit of estimation. No matter how clever your estimator, you cannot beat in variance while remaining unbiased. The Fisher information is the "price tag": more informative data (higher ) allows more precise estimation.
The bound comes from the Cauchy-Schwarz inequality applied to the inner product between the score and the estimator in the space of square-integrable functions. Since the score has mean zero and variance , and (by differentiating the unbiasedness condition), Cauchy-Schwarz gives .
Proof Sketch
For an unbiased estimator with , differentiate both sides with respect to :
Since , this gives .
By Cauchy-Schwarz: .
For i.i.d. observations, the total Fisher information is , giving .
Why It Matters
The Cramér-Rao bound is the benchmark against which all estimators are measured. An estimator achieving the bound is called efficient: it extracts all the information in the data. The MLE is asymptotically efficient: as , its variance converges to .
The bound also reveals a deep connection between estimation and information: the minimum variance is the reciprocal of the Fisher information. This connects to the geometry of statistical models (information geometry), where is the Riemannian metric on the parameter space.
Failure Mode
The Cramér-Rao bound applies only to unbiased estimators. Biased estimators can sometimes achieve lower MSE than the Cramér-Rao bound (because , and a small increase in bias can yield a large decrease in variance). This is the bias-variance tradeoff, and it is why regularized estimators (ridge regression, LASSO) often outperform the MLE in high dimensions.
The James-Stein paradox makes this concrete: see below.
The James-Stein Paradox: When MLE is Not Best
MLE is not always the best estimator
In dimensions , the MLE can be inadmissible. there exist estimators that have strictly lower risk (mean squared error) for every value of the true parameter.
Setup: Let with . The MLE is (just the observation). The James-Stein estimator is:
This shrinks toward zero. The remarkable result: for ALL , .
This does not contradict Cramér-Rao. The multivariate bound is a matrix inequality in the Loewner order, which applies only to unbiased estimators and bounds covariance, not total MSE . The James-Stein estimator is biased, so the bound does not apply to it, and it improves MSE by trading a little bias for a lot of variance reduction, especially when is moderate.
Lesson for ML: Shrinkage (regularization) can uniformly dominate the unpenalized MLE in high dimensions. This is the theoretical justification for ridge regression, LASSO, and other regularized estimators.
MLE as Empirical KL Minimization
There is a standard information-theoretic interpretation of MLE:
where is the empirical distribution and the minimization is over the KL divergence from to the model (up to a constant not depending on ).
This connects MLE to:
- Cross-entropy loss in deep learning: minimizing cross-entropy = MLE for a categorical model
- ERM in learning theory: MLE is ERM with the log-loss
- Information projection: the MLE is the closest model to the data in KL divergence
Canonical Examples
MLE for Gaussian mean
Let with known . The log-likelihood is .
Setting : (the sample mean).
Fisher information: .
Asymptotic variance: .
The sample mean is exactly (not just asymptotically) efficient for the Gaussian mean. It achieves the Cramér-Rao bound with equality for all .
MLE for Bernoulli parameter
Let . The log-likelihood is where .
Setting : (the sample proportion).
Fisher information: .
Asymptotic variance: , which matches exactly. The sample proportion is efficient for the Bernoulli parameter.
MLE for Gaussian variance
Let with known . The MLE is .
But , so it is unbiased. If is unknown, the MLE is , which has . It is biased downward. The unbiased version divides by . This is one of the simplest cases where MLE is biased.
Common Confusions
The likelihood is not a probability distribution over theta
is a function of , not a density. It does not integrate to 1 over and cannot be interpreted as a posterior probability. The Bayesian approach multiplies by a prior and normalizes to get the posterior . The MLE is the mode of the likelihood, not the mode of any posterior (unless the prior is flat).
MLE exists and is unique only under regularity conditions
The MLE may not exist (e.g., in a mixture model where a component can collapse to a single data point, giving infinite likelihood). It may not be unique (e.g., multimodal likelihood). Existence and uniqueness require conditions on the model: typically, concavity of the log-likelihood (exponential families) or compactness of .
Consistency requires the model to be well-specified
If the true distribution does not belong to the model family , the MLE converges to the parameter that minimizes . The closest model to the truth in KL divergence. This is still useful (it gives the best approximation within the model), but the asymptotic variance formula changes: you need the "sandwich" variance instead of , where is the expected Hessian and is the outer product of scores, both under the true distribution . Under correct specification and the sandwich collapses to . The letter is reserved above for the data-dependent observed Fisher information, distinct from these population quantities. See White (1982, Econometrica).
Summary
- MLE maximizes or equivalently minimizes
- Score function has mean zero at truth
- Fisher information measures data informativeness
- Consistency: in probability
- Asymptotic normality:
- Cramér-Rao: for unbiased
- MLE is asymptotically efficient (achieves Cramér-Rao)
- MLE = empirical KL minimization = ERM with log-loss
- James-Stein: MLE is inadmissible in . shrinkage can uniformly improve it
- In practice: MLE is biased in finite samples; regularization often helps
Exercises
Problem
Compute the MLE and Fisher information for the exponential distribution: for .
Problem
Show that for the Bernoulli model with observations and successes, the MLE is . Explain why this is problematic and how a Bayesian approach with a Beta(1,1) prior would give a different answer.
Problem
Prove the Cramér-Rao bound for the scalar case. Let be an unbiased estimator of under the model . Show that using the Cauchy-Schwarz inequality.
Related Comparisons
References
Canonical:
- Lehmann & Casella, Theory of Point Estimation (2nd ed., 1998), Chapter 6 (regularity conditions, asymptotic normality) and Chapters 5, 7
- van der Vaart, Asymptotic Statistics (1998), Chapter 5 (MLE and regularity conditions) and Chapter 7
- Casella & Berger, Statistical Inference (2nd ed., 2002), Chapters 7 and 10
Current:
- Wasserman, All of Statistics (2004), Chapters 9-10
- Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapter 8
Next Topics
Building on MLE theory:
- Hypothesis testing for ML: using the MLE and Fisher information for statistical tests (Wald, likelihood ratio, score tests)
- EM algorithm: MLE for latent variable models when direct maximization is intractable
- Empirical risk minimization: the learning-theoretic generalization of MLE to arbitrary loss functions
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Differentiation in RnLayer 0A
Builds on This
- AIC and BICLayer 2
- Asymptotic StatisticsLayer 0B
- Bayesian EstimationLayer 0B
- Distributional SemanticsLayer 2
- The EM AlgorithmLayer 2
- Fisher InformationLayer 0B
- Gaussian Mixture Models and EMLayer 2
- Linear RegressionLayer 1
- Logistic RegressionLayer 1
- Logspline Density EstimationLayer 2
- Minimax Lower BoundsLayer 3
- Neyman-Pearson and Hypothesis Testing TheoryLayer 2
- Robust Statistics and M-EstimatorsLayer 3
- Shrinkage Estimation and the James-Stein EstimatorLayer 0B
- Stein's ParadoxLayer 0B
- Sufficient Statistics and Exponential FamiliesLayer 0B
- Survival AnalysisLayer 3
- Variational AutoencodersLayer 3