Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Statistical Estimation

Maximum Likelihood Estimation

MLE: find the parameter that maximizes the likelihood of observed data. Consistency, asymptotic normality, Fisher information, Cramér-Rao efficiency, and when MLE fails.

CoreTier 1Stable~65 min

Why This Matters

Maximum likelihood estimation is the most widely used method for fitting parametric models to data. When you train a logistic regression by minimizing cross-entropy loss, you are doing MLE. When you fit a Gaussian mixture model, you are doing MLE (via EM). When you train a language model by minimizing perplexity, you are doing MLE.

Understanding MLE theory answers three fundamental questions:

  1. Does it work?: Consistency: as nn \to \infty, the MLE converges to the true parameter.
  2. How accurate is it?: Asymptotic normality: the MLE is approximately Gaussian with variance 1/(nI(θ))1/(nI(\theta)), where I(θ)I(\theta) is the Fisher information.
  3. Can you do better?: The Cramér-Rao bound says no regular unbiased estimator can have smaller variance than 1/(nI(θ))1/(nI(\theta)). MLE achieves this bound asymptotically. It is efficient.
1.00
n=10 | MLE=1.77 | σ=1
MLE = 1.77μ̂=1.00ΔLL = -2.96Log-likelihood-2-101234Parameter μLL(μ̂) = -15.37LL(MLE) = -12.41

Mental Model

You observe data x1,,xnx_1, \ldots, x_n from a distribution p(xθ)p(x | \theta^*) with unknown parameter θ\theta^*. The MLE asks: which parameter θ\theta makes the observed data most likely?

Formally, you maximize the likelihood function L(θ)=i=1np(xiθ)L(\theta) = \prod_{i=1}^n p(x_i | \theta), or equivalently minimize the negative log-likelihood (θ)=ilogp(xiθ)-\ell(\theta) = -\sum_i \log p(x_i | \theta).

The intuition: among all possible parameter values, pick the one under which your data looks the least surprising. The log-likelihood turns products into sums, making optimization tractable and connecting MLE to information theory (the log-likelihood is the empirical KL divergence up to a constant).

Formal Setup and Notation

Let x1,,xnx_1, \ldots, x_n be i.i.d. from p(xθ)p(x | \theta^*) where θΘRd\theta^* \in \Theta \subseteq \mathbb{R}^d. The parametric family {p(xθ):θΘ}\{p(x | \theta) : \theta \in \Theta\} is the statistical model.

Definition

Likelihood Function

The likelihood function is:

L(θ)=i=1np(xiθ)L(\theta) = \prod_{i=1}^n p(x_i | \theta)

It is a function of θ\theta (not of the data. The data is fixed). Despite looking like a joint density, L(θ)L(\theta) is not a probability distribution over θ\theta; it is not normalized and does not integrate to 1.

Definition

Log-Likelihood

The log-likelihood is:

(θ)=i=1nlogp(xiθ)\ell(\theta) = \sum_{i=1}^n \log p(x_i | \theta)

The log transform converts products to sums, which is essential for both computation (numerical stability) and theory (sums of i.i.d. terms are amenable to the law of large numbers and CLT).

Definition

Maximum Likelihood Estimator

The maximum likelihood estimator is:

θ^MLE=argmaxθΘ(θ)=argmaxθΘi=1nlogp(xiθ)\hat{\theta}_{\text{MLE}} = \arg\max_{\theta \in \Theta} \ell(\theta) = \arg\max_{\theta \in \Theta} \sum_{i=1}^n \log p(x_i | \theta)

Equivalently, it minimizes the negative log-likelihood. When the model is well-specified (θΘ\theta^* \in \Theta), MLE minimizes the empirical KL divergence from the data distribution to the model.

Core Definitions

Definition

Score Function

The score function is the gradient of the log-likelihood with respect to θ\theta:

s(θ;x)=θlogp(xθ)=θp(xθ)p(xθ)s(\theta; x) = \frac{\partial}{\partial \theta} \log p(x | \theta) = \frac{\nabla_\theta p(x | \theta)}{p(x | \theta)}

Key property: Eθ[s(θ;X)]=0\mathbb{E}_{\theta^*}[s(\theta^*; X)] = 0. The score has mean zero at the true parameter. This follows from differentiating p(xθ)dx=1\int p(x | \theta) dx = 1 with respect to θ\theta.

The MLE satisfies i=1ns(θ^;xi)=0\sum_{i=1}^n s(\hat{\theta}; x_i) = 0 (the score equation), which is the first-order optimality condition for the log-likelihood.

Definition

Fisher Information

The Fisher information at θ\theta is:

I(θ)=Eθ ⁣[s(θ;X)s(θ;X)]=Covθ(s(θ;X))I(\theta) = \mathbb{E}_\theta\!\left[s(\theta; X) \, s(\theta; X)^\top\right] = \text{Cov}_\theta(s(\theta; X))

Under regularity conditions, this equals:

I(θ)=Eθ ⁣[2θ2logp(Xθ)]I(\theta) = -\mathbb{E}_\theta\!\left[\frac{\partial^2}{\partial \theta^2} \log p(X | \theta)\right]

In the scalar case (d=1d = 1): I(θ)=E[(s(θ;X))2]=E[(θ)]I(\theta) = \mathbb{E}[(s(\theta; X))^2] = -\mathbb{E}[\ell''(\theta)].

The Fisher information measures how much information each observation carries about θ\theta. High Fisher information means the likelihood is sharply peaked, so θ\theta is easy to estimate. Low Fisher information means the likelihood is flat, so θ\theta is hard to pin down.

Definition

Observed Fisher Information

The observed Fisher information is:

J(θ^)=1ni=1n2θ2logp(xiθ^)J(\hat{\theta}) = -\frac{1}{n}\sum_{i=1}^n \frac{\partial^2}{\partial \theta^2} \log p(x_i | \hat{\theta})

This is the data-dependent (random) version of the Fisher information. By the law of large numbers, J(θ^)I(θ)J(\hat{\theta}) \to I(\theta^*) as nn \to \infty. In practice, you use J(θ^)J(\hat{\theta}) to estimate I(θ)I(\theta^*) for confidence intervals.

Main Theorems

Regularity Conditions (Cramér's Conditions)

The asymptotic results below (consistency, asymptotic normality, efficiency) require a standard set of regularity conditions on the parametric family {p(xθ):θΘ}\{p(x | \theta) : \theta \in \Theta\}. The canonical list (van der Vaart 1998, Ch. 5; Lehmann and Casella 1998, Ch. 6):

  1. Identifiability: θ1θ2pθ1pθ2\theta_1 \neq \theta_2 \Rightarrow p_{\theta_1} \neq p_{\theta_2}.
  2. Common support: the support of pθp_\theta does not depend on θ\theta.
  3. Interior: the true parameter θ0\theta_0 lies in the interior of the parameter space Θ\Theta.
  4. Smoothness: logpθ(x)\log p_\theta(x) is three-times continuously differentiable in θ\theta, with third derivatives dominated by an integrable function.
  5. Interchange: differentiation under the integral sign is valid (so score and Fisher information identities hold).
  6. Fisher information: I(θ)I(\theta) is finite and positive definite at θ0\theta_0.

When any of these fails, the asymptotic distribution of the MLE can change or cease to exist. Boundary cases, non-identifiable models, and families with parameter-dependent support are the most common failure scenarios.

Theorem

Consistency of MLE

Statement

Under regularity conditions (identifiability, compactness of Θ\Theta, and continuity of logp(xθ)\log p(x|\theta) in θ\theta), the MLE is consistent:

θ^MLEPθ\hat{\theta}_{\text{MLE}} \xrightarrow{P} \theta^*

as nn \to \infty.

Intuition

The normalized log-likelihood 1n(θ)=1nilogp(xiθ)\frac{1}{n}\ell(\theta) = \frac{1}{n}\sum_i \log p(x_i | \theta) converges to Eθ[logp(Xθ)]\mathbb{E}_{\theta^*}[\log p(X | \theta)] by the law of large numbers. This expected value is maximized uniquely at θ=θ\theta = \theta^* by the information inequality (Gibbs' inequality): for any θθ\theta \neq \theta^*,

Eθ[logp(Xθ)]>Eθ[logp(Xθ)]\mathbb{E}_{\theta^*}[\log p(X | \theta^*)] > \mathbb{E}_{\theta^*}[\log p(X | \theta)]

because DKL(pθpθ)>0D_{\text{KL}}(p_{\theta^*} \| p_\theta) > 0. So the limiting objective has a unique maximum at θ\theta^*, and the maximizer of the empirical version converges to it.

Proof Sketch

Step 1: By the law of large numbers, 1n(θ)E[logp(Xθ)]\frac{1}{n}\ell(\theta) \to \mathbb{E}[\log p(X | \theta)] for each θ\theta.

Step 2: Under compactness and continuity, this convergence is uniform in θ\theta: supθ1n(θ)E[logp(Xθ)]0\sup_\theta |\frac{1}{n}\ell(\theta) - \mathbb{E}[\log p(X|\theta)]| \to 0.

Step 3: The limit E[logp(Xθ)]\mathbb{E}[\log p(X | \theta)] has a unique maximizer at θ\theta^* (by Gibbs' inequality / KL positivity).

Step 4: By a standard argument (maximizers of uniformly converging functions converge to the maximizer of the limit), θ^θ\hat{\theta} \to \theta^*.

Why It Matters

Consistency is the minimal requirement for any estimator: with enough data, you recover the truth. MLE achieves this under mild conditions. Consistency requires the model to be identifiable: different parameters must give different distributions. If the model is overparameterized (as in neural networks), the MLE is not unique, and consistency of the parameter vector fails (though the fitted distribution may still converge).

Failure Mode

Consistency can fail if: (1) the model is not identifiable (e.g., mixture models with label switching), (2) the parameter space is not compact and the MLE drifts to the boundary (e.g., variance estimate in a mixture with a component collapsing to a point), or (3) the number of parameters grows with nn (the "Neyman-Scott problem": with nn means and one variance, the MLE of the variance is inconsistent).

Theorem

Asymptotic Normality of MLE

Statement

Under regularity conditions, the MLE is asymptotically normal:

n(θ^MLEθ)dN(0,I(θ)1)\sqrt{n}(\hat{\theta}_{\text{MLE}} - \theta^*) \xrightarrow{d} \mathcal{N}(0, I(\theta^*)^{-1})

Equivalently, for large nn:

θ^MLEN ⁣(θ,I(θ)1n)\hat{\theta}_{\text{MLE}} \approx \mathcal{N}\!\left(\theta^*, \frac{I(\theta^*)^{-1}}{n}\right)

The variance I(θ)1/nI(\theta^*)^{-1}/n is the Cramér-Rao lower bound, achieved by the MLE in the limit.

Intuition

The log-likelihood (θ)\ell(\theta) is a sum of i.i.d. terms. Near θ\theta^*, a Taylor expansion gives:

(θ)(θ)+(θ)(θθ)+12(θ)(θθ)2\ell(\theta) \approx \ell(\theta^*) + \ell'(\theta^*)(\theta - \theta^*) + \frac{1}{2}\ell''(\theta^*)(\theta - \theta^*)^2

Setting (θ^)=0\ell'(\hat{\theta}) = 0 and solving: θ^θ(θ)/(θ)\hat{\theta} \approx \theta^* - \ell'(\theta^*)/\ell''(\theta^*).

The numerator (θ)=is(θ;xi)\ell'(\theta^*) = \sum_i s(\theta^*; x_i) is a sum of i.i.d. mean-zero terms with variance nI(θ)nI(\theta^*), so by the CLT: (θ)/nN(0,I(θ))\ell'(\theta^*)/\sqrt{n} \to \mathcal{N}(0, I(\theta^*)).

The denominator (θ)/nI(θ)\ell''(\theta^*)/n \to -I(\theta^*) by the LLN.

Combining: n(θ^θ)(θ)/n(θ)/nN(0,I)I=N(0,I1)\sqrt{n}(\hat{\theta} - \theta^*) \approx \frac{-\ell'(\theta^*)/\sqrt{n}}{\ell''(\theta^*)/n} \to \frac{\mathcal{N}(0, I)}{-I} = \mathcal{N}(0, I^{-1}).

Proof Sketch

Step 1 (Taylor expand). Expand the score at θ^\hat{\theta} around θ\theta^*: 0=1n(θ^)=1n(θ)+1n(θ~)(θ^θ)0 = \frac{1}{n}\ell'(\hat{\theta}) = \frac{1}{n}\ell'(\theta^*) + \frac{1}{n}\ell''(\tilde{\theta})(\hat{\theta} - \theta^*) for some θ~\tilde{\theta} between θ^\hat{\theta} and θ\theta^*.

Step 2 (Solve). Rearrange: n(θ^θ)=((θ~)n)1(θ)n\sqrt{n}(\hat{\theta} - \theta^*) = -\left(\frac{\ell''(\tilde{\theta})}{n}\right)^{-1} \cdot \frac{\ell'(\theta^*)}{\sqrt{n}}.

Step 3 (CLT + LLN). By CLT: (θ)/nN(0,I(θ))\ell'(\theta^*)/\sqrt{n} \to \mathcal{N}(0, I(\theta^*)). By LLN + consistency: (θ~)/nI(θ)\ell''(\tilde{\theta})/n \to -I(\theta^*).

Step 4 (Slutsky). By Slutsky's theorem: n(θ^θ)N(0,I(θ)1)\sqrt{n}(\hat{\theta} - \theta^*) \to \mathcal{N}(0, I(\theta^*)^{-1}).

Why It Matters

Asymptotic normality is the basis for:

  • Confidence intervals: θ^±zα/2/nI(θ^)\hat{\theta} \pm z_{\alpha/2}/\sqrt{nI(\hat{\theta})}
  • Hypothesis tests: Wald test, likelihood ratio test, score test
  • Model comparison: AIC, BIC (which penalize the number of parameters using the asymptotic distribution of the log-likelihood)

It also shows that MLE is asymptotically efficient: its variance matches the Cramér-Rao lower bound.

Failure Mode

Asymptotic normality fails when: (1) θ\theta^* is on the boundary of Θ\Theta (e.g., estimating a variance that could be zero), (2) the Fisher information is zero or degenerate (flat likelihood), (3) the model is misspecified (the asymptotic distribution changes. The "sandwich estimator" is needed), or (4) the sample size is too small for the approximation to be accurate (especially in high dimensions).

Theorem

Cramér-Rao Lower Bound

Statement

Let T(X1,,Xn)T(X_1, \ldots, X_n) be any unbiased estimator of θ\theta (i.e., Eθ[T]=θ\mathbb{E}_\theta[T] = \theta for all θ\theta). Then:

Varθ(T)1nI(θ)\text{Var}_\theta(T) \geq \frac{1}{n \, I(\theta)}

In the multivariate case (θRd\theta \in \mathbb{R}^d), the covariance matrix satisfies Cov(T)(nI(θ))1\text{Cov}(T) \succeq (nI(\theta))^{-1} in the Loewner order.

An estimator that achieves equality is called efficient.

Intuition

The Cramér-Rao bound quantifies the fundamental limit of estimation. No matter how clever your estimator, you cannot beat 1/(nI(θ))1/(nI(\theta)) in variance while remaining unbiased. The Fisher information I(θ)I(\theta) is the "price tag": more informative data (higher II) allows more precise estimation.

The bound comes from the Cauchy-Schwarz inequality applied to the inner product between the score s(θ;X)s(\theta; X) and the estimator T(X)T(X) in the space of square-integrable functions. Since the score has mean zero and variance I(θ)I(\theta), and Cov(T,s)=1\text{Cov}(T, s) = 1 (by differentiating the unbiasedness condition), Cauchy-Schwarz gives 1Var(T)I(θ)1 \leq \text{Var}(T) \cdot I(\theta).

Proof Sketch

For an unbiased estimator TT with Eθ[T]=θ\mathbb{E}_\theta[T] = \theta, differentiate both sides with respect to θ\theta:

1=ddθEθ[T]=T(x)θp(xθ)dx=T(x)s(θ;x)p(xθ)dx=E[Ts(θ;X)]1 = \frac{d}{d\theta}\mathbb{E}_\theta[T] = \int T(x) \frac{\partial}{\partial \theta} p(x|\theta) dx = \int T(x) s(\theta; x) p(x|\theta) dx = \mathbb{E}[T \cdot s(\theta; X)]

Since E[s(θ;X)]=0\mathbb{E}[s(\theta; X)] = 0, this gives Cov(T,s)=1\text{Cov}(T, s) = 1.

By Cauchy-Schwarz: 1=Cov(T,s)2Var(T)Var(s)=Var(T)I(θ)1 = |\text{Cov}(T, s)|^2 \leq \text{Var}(T) \cdot \text{Var}(s) = \text{Var}(T) \cdot I(\theta).

For nn i.i.d. observations, the total Fisher information is nI(θ)nI(\theta), giving Var(T)1/(nI(θ))\text{Var}(T) \geq 1/(nI(\theta)).

Why It Matters

The Cramér-Rao bound is the benchmark against which all estimators are measured. An estimator achieving the bound is called efficient: it extracts all the information in the data. The MLE is asymptotically efficient: as nn \to \infty, its variance converges to 1/(nI(θ))1/(nI(\theta)).

The bound also reveals a deep connection between estimation and information: the minimum variance is the reciprocal of the Fisher information. This connects to the geometry of statistical models (information geometry), where I(θ)I(\theta) is the Riemannian metric on the parameter space.

Failure Mode

The Cramér-Rao bound applies only to unbiased estimators. Biased estimators can sometimes achieve lower MSE than the Cramér-Rao bound (because MSE=Var+Bias2\text{MSE} = \text{Var} + \text{Bias}^2, and a small increase in bias can yield a large decrease in variance). This is the bias-variance tradeoff, and it is why regularized estimators (ridge regression, LASSO) often outperform the MLE in high dimensions.

The James-Stein paradox makes this concrete: see below.

The James-Stein Paradox: When MLE is Not Best

Watch Out

MLE is not always the best estimator

In dimensions d3d \geq 3, the MLE can be inadmissible. there exist estimators that have strictly lower risk (mean squared error) for every value of the true parameter.

Setup: Let XN(θ,Id)X \sim \mathcal{N}(\theta, I_d) with d3d \geq 3. The MLE is θ^MLE=X\hat{\theta}_{\text{MLE}} = X (just the observation). The James-Stein estimator is:

θ^JS=(1d2X2)X\hat{\theta}_{\text{JS}} = \left(1 - \frac{d - 2}{\|X\|^2}\right) X

This shrinks XX toward zero. The remarkable result: for ALL θ\theta, E[θ^JSθ2]<E[θ^MLEθ2]=d\mathbb{E}[\|\hat{\theta}_{\text{JS}} - \theta\|^2] < \mathbb{E}[\|\hat{\theta}_{\text{MLE}} - \theta\|^2] = d.

This does not contradict Cramér-Rao. The multivariate bound is a matrix inequality Cov(T)(nI(θ))1\text{Cov}(T) \succeq (n I(\theta))^{-1} in the Loewner order, which applies only to unbiased estimators and bounds covariance, not total MSE E[Tθ2]=tr(Cov(T))+Bias2\mathbb{E}[\|T - \theta\|^2] = \text{tr}(\text{Cov}(T)) + \|\text{Bias}\|^2. The James-Stein estimator is biased, so the bound does not apply to it, and it improves MSE by trading a little bias for a lot of variance reduction, especially when θ\|\theta\| is moderate.

Lesson for ML: Shrinkage (regularization) can uniformly dominate the unpenalized MLE in high dimensions. This is the theoretical justification for ridge regression, LASSO, and other regularized estimators.

MLE as Empirical KL Minimization

There is a standard information-theoretic interpretation of MLE:

θ^MLE=argminθ1ni=1n ⁣[logp(xiθ)]=argminθD^KL(P^npθ)\hat{\theta}_{\text{MLE}} = \arg\min_\theta \frac{1}{n}\sum_{i=1}^n \!\left[-\log p(x_i | \theta)\right] = \arg\min_\theta \hat{D}_{\text{KL}}(\hat{P}_n \| p_\theta)

where P^n\hat{P}_n is the empirical distribution and the minimization is over the KL divergence from P^n\hat{P}_n to the model pθp_\theta (up to a constant not depending on θ\theta).

This connects MLE to:

  • Cross-entropy loss in deep learning: minimizing cross-entropy = MLE for a categorical model
  • ERM in learning theory: MLE is ERM with the log-loss
  • Information projection: the MLE is the closest model to the data in KL divergence

Canonical Examples

Example

MLE for Gaussian mean

Let x1,,xnN(μ,σ2)x_1, \ldots, x_n \sim \mathcal{N}(\mu, \sigma^2) with known σ2\sigma^2. The log-likelihood is (μ)=n2log(2πσ2)12σ2i(xiμ)2\ell(\mu) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_i(x_i - \mu)^2.

Setting (μ)=0\ell'(\mu) = 0: μ^=xˉ\hat{\mu} = \bar{x} (the sample mean).

Fisher information: I(μ)=1/σ2I(\mu) = 1/\sigma^2.

Asymptotic variance: 1/(nI(μ))=σ2/n1/(nI(\mu)) = \sigma^2/n.

The sample mean is exactly (not just asymptotically) efficient for the Gaussian mean. It achieves the Cramér-Rao bound with equality for all nn.

Example

MLE for Bernoulli parameter

Let x1,,xnBernoulli(p)x_1, \ldots, x_n \sim \text{Bernoulli}(p). The log-likelihood is (p)=klogp+(nk)log(1p)\ell(p) = k\log p + (n-k)\log(1-p) where k=ixik = \sum_i x_i.

Setting (p)=0\ell'(p) = 0: p^=k/n\hat{p} = k/n (the sample proportion).

Fisher information: I(p)=1/(p(1p))I(p) = 1/(p(1-p)).

Asymptotic variance: p(1p)/np(1-p)/n, which matches Var(p^)=p(1p)/n\text{Var}(\hat{p}) = p(1-p)/n exactly. The sample proportion is efficient for the Bernoulli parameter.

Example

MLE for Gaussian variance

Let x1,,xnN(μ,σ2)x_1, \ldots, x_n \sim \mathcal{N}(\mu, \sigma^2) with known μ=0\mu = 0. The MLE is σ^2=1nixi2\hat{\sigma}^2 = \frac{1}{n}\sum_i x_i^2.

But E[σ^2]=σ2\mathbb{E}[\hat{\sigma}^2] = \sigma^2, so it is unbiased. If μ\mu is unknown, the MLE is σ^2=1ni(xixˉ)2\hat{\sigma}^2 = \frac{1}{n}\sum_i(x_i - \bar{x})^2, which has E[σ^2]=n1nσ2\mathbb{E}[\hat{\sigma}^2] = \frac{n-1}{n}\sigma^2. It is biased downward. The unbiased version divides by n1n-1. This is one of the simplest cases where MLE is biased.

Common Confusions

Watch Out

The likelihood is not a probability distribution over theta

L(θ)L(\theta) is a function of θ\theta, not a density. It does not integrate to 1 over θ\theta and cannot be interpreted as a posterior probability. The Bayesian approach multiplies L(θ)L(\theta) by a prior π(θ)\pi(\theta) and normalizes to get the posterior p(θx)L(θ)π(θ)p(\theta | x) \propto L(\theta)\pi(\theta). The MLE is the mode of the likelihood, not the mode of any posterior (unless the prior is flat).

Watch Out

MLE exists and is unique only under regularity conditions

The MLE may not exist (e.g., in a mixture model where a component can collapse to a single data point, giving infinite likelihood). It may not be unique (e.g., multimodal likelihood). Existence and uniqueness require conditions on the model: typically, concavity of the log-likelihood (exponential families) or compactness of Θ\Theta.

Watch Out

Consistency requires the model to be well-specified

If the true distribution PP^* does not belong to the model family {pθ}\{p_\theta\}, the MLE converges to the parameter θ\theta^* that minimizes DKL(Ppθ)D_{\text{KL}}(P^* \| p_\theta). The closest model to the truth in KL divergence. This is still useful (it gives the best approximation within the model), but the asymptotic variance formula changes: you need the "sandwich" variance A(θ)1B(θ)A(θ)1A(\theta^*)^{-1} B(\theta^*) A(\theta^*)^{-1} instead of I(θ)1I(\theta^*)^{-1}, where A(θ)=EP[2logpθ(X)]A(\theta) = -\mathbb{E}_{P^*}[\nabla^2 \log p_\theta(X)] is the expected Hessian and B(θ)=EP[s(θ;X)s(θ;X)]B(\theta) = \mathbb{E}_{P^*}[s(\theta; X) s(\theta; X)^\top] is the outer product of scores, both under the true distribution PP^*. Under correct specification A=B=I(θ)A = B = I(\theta^*) and the sandwich collapses to I(θ)1I(\theta^*)^{-1}. The letter JJ is reserved above for the data-dependent observed Fisher information, distinct from these population quantities. See White (1982, Econometrica).

Summary

  • MLE maximizes L(θ)=ip(xiθ)L(\theta) = \prod_i p(x_i | \theta) or equivalently minimizes ilogp(xiθ)-\sum_i \log p(x_i | \theta)
  • Score function s(θ;x)=θlogp(xθ)s(\theta; x) = \nabla_\theta \log p(x | \theta) has mean zero at truth
  • Fisher information I(θ)=Var(s)=E[2logp]I(\theta) = \text{Var}(s) = -\mathbb{E}[\nabla^2 \log p] measures data informativeness
  • Consistency: θ^MLEθ\hat{\theta}_{\text{MLE}} \to \theta^* in probability
  • Asymptotic normality: n(θ^θ)N(0,I1)\sqrt{n}(\hat{\theta} - \theta^*) \to \mathcal{N}(0, I^{-1})
  • Cramér-Rao: Var(T)1/(nI(θ))\text{Var}(T) \geq 1/(nI(\theta)) for unbiased TT
  • MLE is asymptotically efficient (achieves Cramér-Rao)
  • MLE = empirical KL minimization = ERM with log-loss
  • James-Stein: MLE is inadmissible in d3d \geq 3. shrinkage can uniformly improve it
  • In practice: MLE is biased in finite samples; regularization often helps

Exercises

ExerciseCore

Problem

Compute the MLE and Fisher information for the exponential distribution: p(xλ)=λeλxp(x | \lambda) = \lambda e^{-\lambda x} for x>0x > 0.

ExerciseCore

Problem

Show that for the Bernoulli model with n=10n = 10 observations and k=0k = 0 successes, the MLE is p^=0\hat{p} = 0. Explain why this is problematic and how a Bayesian approach with a Beta(1,1) prior would give a different answer.

ExerciseAdvanced

Problem

Prove the Cramér-Rao bound for the scalar case. Let T(X1,,Xn)T(X_1, \ldots, X_n) be an unbiased estimator of θ\theta under the model p(xθ)p(x|\theta). Show that Var(T)1/(nI(θ))\text{Var}(T) \geq 1/(nI(\theta)) using the Cauchy-Schwarz inequality.

Related Comparisons

References

Canonical:

  • Lehmann & Casella, Theory of Point Estimation (2nd ed., 1998), Chapter 6 (regularity conditions, asymptotic normality) and Chapters 5, 7
  • van der Vaart, Asymptotic Statistics (1998), Chapter 5 (MLE and regularity conditions) and Chapter 7
  • Casella & Berger, Statistical Inference (2nd ed., 2002), Chapters 7 and 10

Current:

  • Wasserman, All of Statistics (2004), Chapters 9-10
  • Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapter 8

Next Topics

Building on MLE theory:

  • Hypothesis testing for ML: using the MLE and Fisher information for statistical tests (Wald, likelihood ratio, score tests)
  • EM algorithm: MLE for latent variable models when direct maximization is intractable
  • Empirical risk minimization: the learning-theoretic generalization of MLE to arbitrary loss functions

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics