Skip to main content

Statistical Estimation

Likelihood-Ratio, Wald, and Score Tests

The three asymptotic tests built from the likelihood: LRT compares maximized likelihoods, Wald compares the MLE to the null using the inverse Fisher information, score uses the gradient of the log-likelihood at the null. All three are asymptotically Chi-squared and equivalent under regularity; they disagree in finite samples and choice depends on the problem.

ImportantAdvancedTier 1StableCore spine~55 min
For:StatsActuarial

Why This Matters

Three asymptotic tests dominate parametric hypothesis testing: the likelihood-ratio test (LRT, Wilks), the Wald test, and the score test (Rao). All three are derived from the likelihood, all three are asymptotically Chi-squared under regularity, and all three are asymptotically equivalent under the null. They differ in finite samples and in which numerical quantity they require: maximizing the likelihood under the null and the alternative (LRT), under the alternative only (Wald), or under the null only (score).

The three statistics test the same hypothesis using different parts of the log-likelihood surface. The LRT compares the heights of the surface at the constrained and unconstrained maxima. The Wald test measures the horizontal distance from the unconstrained maximum to the null point, weighted by the local curvature. The score test measures the slope of the surface at the null point, weighted by the same curvature. Far from the null, the three statistics often disagree noticeably; close to the null they agree to leading order.

The choice among them is computational and structural:

  1. Use the LRT when you can fit both the null and the alternative model and want the most "correct" test in the small-sample regime. The LRT is invariant under reparameterization, which is its main theoretical advantage.
  2. Use the Wald test when you have the unrestricted MLE and its standard error. This is what most regression-output tables produce automatically.
  3. Use the score test when fitting the alternative is expensive but fitting the null is cheap. The score test is computed entirely from the null fit and is invariant under reparameterizations of the parameters being tested (but not under reparameterizations of nuisance parameters).

This page derives each, states the asymptotic Chi-squared distribution, and explains the equivalence.

Setup

Let θΘRp\theta\in\Theta\subset\mathbb{R}^p index a parametric family p(;θ)p(\cdot;\theta). Partition θ=(ψ,η)\theta = (\psi, \eta) where ψRq\psi\in\mathbb{R}^q is the parameter of interest and ηRpq\eta\in\mathbb{R}^{p-q} is nuisance. Test H0:ψ=ψ0versusH1:ψψ0.H_0: \psi = \psi_0\qquad\text{versus}\qquad H_1: \psi\ne\psi_0. Let n(θ)=i=1nlogp(Xi;θ)\ell_n(\theta) = \sum_{i=1}^n\log p(X_i;\theta) be the log-likelihood for an i.i.d. sample.

Three estimators matter. The unrestricted MLE θ^n=argmaxθn(θ)=(ψ^n,η^n)\hat\theta_n = \arg\max_\theta\ell_n(\theta) = (\hat\psi_n, \hat\eta_n). The constrained MLE under H0H_0, θ~n=(ψ0,η~n)\tilde\theta_n = (\psi_0, \tilde\eta_n) where η~n=argmaxηn(ψ0,η)\tilde\eta_n = \arg\max_\eta\ell_n(\psi_0, \eta). The score vector Un(θ)=n/θU_n(\theta) = \partial\ell_n/\partial\theta and the observed (or expected) Fisher information matrix In(θ)=2n/θ2I_n(\theta) = -\partial^2\ell_n/\partial\theta^2 (or its expectation).

The Likelihood-Ratio Test

Theorem

Wilks's Likelihood-Ratio Asymptotics

Statement

The likelihood-ratio statistic Λn=2logsupθΘ0Ln(θ)supθΘLn(θ)=2[n(θ^n)n(θ~n)]\Lambda_n = -2\log\frac{\sup_{\theta\in\Theta_0}L_n(\theta)}{\sup_{\theta\in\Theta}L_n(\theta)} = 2[\ell_n(\hat\theta_n) - \ell_n(\tilde\theta_n)] satisfies Λndχq2\Lambda_n\to_d\chi^2_q under H0H_0 as nn\to\infty, where qq is the number of restrictions imposed by H0H_0.

Intuition

Λn\Lambda_n is twice the height difference between the unconstrained and constrained log-likelihood. Under the null, the constraint is true and the height drop is small; expand the log-likelihood to second order around the unrestricted MLE and the drop is a quadratic form in the constraint, asymptotically Chi-squared.

Proof Sketch

Taylor expand n(θ)\ell_n(\theta) around θ^n\hat\theta_n to second order: n(θ)n(θ^n)12n(θθ^n)I(θ0)(θθ^n)\ell_n(\theta)\approx\ell_n(\hat\theta_n) - \tfrac12 n(\theta - \hat\theta_n)^\top I(\theta_0)(\theta - \hat\theta_n) where I(θ0)I(\theta_0) is the per-observation Fisher information. Evaluating at θ=θ~n\theta = \tilde\theta_n and noting that the constrained MLE differs from the unconstrained MLE by an Op(n1/2)O_p(n^{-1/2}) correction, the quadratic form has the form n(ψ^nψ0)[I(θ0)1]ψψ1(ψ^nψ0)n(\hat\psi_n - \psi_0)^\top[I(\theta_0)^{-1}]_{\psi\psi}^{-1}(\hat\psi_n - \psi_0) up to Op(n1)O_p(n^{-1}) errors. By MLE asymptotic normality, n(ψ^nψ0)dN(0,[I1]ψψ)\sqrt n(\hat\psi_n - \psi_0)\to_d\mathcal{N}(0, [I^{-1}]_{\psi\psi}), so the quadratic form converges to a χq2\chi^2_q.

Why It Matters

The LRT is the natural test from the Neyman-Pearson perspective: it generalizes the simple-versus-simple most-powerful test to composite hypotheses. Under regularity, Wilks's theorem guarantees the asymptotic Chi-squared reference, and the test is asymptotically locally most powerful in many parametric models. The factor 2-2 exists so the leading-order quadratic form has the natural Chi-squared scaling.

Failure Mode

Wilks's theorem fails at parameter-space boundaries (the asymptotic distribution becomes a mixture of Chi-squareds with different degrees of freedom; this is the Chernoff-1954 phenomenon). It also fails when the model is not regular (parameters on a boundary, identifiability failures, non-differentiable likelihoods). For boundary problems, see Self-Liang (1987). For non-regular problems, the LRT typically has a non-standard limiting distribution that must be derived case by case.

The Wald Test

Theorem

Wald Test Asymptotics

Statement

The Wald statistic is Wn=n(ψ^nψ0)Σ^n1(ψ^nψ0),W_n = n(\hat\psi_n - \psi_0)^\top \hat\Sigma_n^{-1}(\hat\psi_n - \psi_0), where Σ^n\hat\Sigma_n is a consistent estimator of the asymptotic covariance of nψ^n\sqrt n\hat\psi_n (typically the ψψ\psi\psi block of the inverse observed or expected Fisher information at the unrestricted MLE). Under H0H_0, Wndχq2W_n\to_d\chi^2_q.

Intuition

WnW_n standardizes the unrestricted MLE ψ^n\hat\psi_n by its estimated standard error and squares the result. It is the multivariate z2z^2 statistic: how many standard deviations is the MLE from the null value, squared and summed across components. Under the null and large samples, the standardized MLE is approximately standard Normal, so its squared norm is approximately Chi-squared.

Proof Sketch

MLE asymptotic normality: n(ψ^nψ)dN(0,Σ)\sqrt n(\hat\psi_n - \psi)\to_d\mathcal{N}(0, \Sigma) where Σ=[I(θ0)1]ψψ\Sigma = [I(\theta_0)^{-1}]_{\psi\psi}. Under the null, ψ=ψ0\psi = \psi_0, so n(ψ^nψ0)dN(0,Σ)\sqrt n(\hat\psi_n - \psi_0)\to_d\mathcal{N}(0, \Sigma). The standardized statistic n(ψ^nψ0)Σ1(ψ^nψ0)dχq2n(\hat\psi_n - \psi_0)^\top\Sigma^{-1}(\hat\psi_n - \psi_0)\to_d\chi^2_q. Replacing Σ\Sigma by a consistent estimator Σ^n\hat\Sigma_n preserves the limit by Slutsky's theorem.

Why It Matters

Most regression-output software (R's summary.lm, Python's statsmodels, Stata) reports Wald-style test statistics by default, because all that is needed is the MLE and its estimated covariance. The pp-values associated with individual regression coefficients are Wald tests on each coefficient against zero. The Wald test is the basis for the standard confidence interval ψ^n±1.96SE^\hat\psi_n\pm 1.96\cdot\hat{\text{SE}} in the scalar case.

Failure Mode

The Wald test is not invariant under reparameterization. If you transform ψ\psi by a smooth nonlinear map gg, the Wald statistic for H0:g(ψ)=g(ψ0)H_0: g(\psi) = g(\psi_0) can give a different pp-value than the original Wald statistic for H0:ψ=ψ0H_0: \psi = \psi_0, even though the hypotheses are equivalent. This is the Hauck-Donner phenomenon: when the unrestricted MLE is far from the null, the Wald statistic can shrink toward zero, paradoxically reducing rejection. The LRT does not have this problem.

The Score Test

Theorem

Score (Rao) Test Asymptotics

Statement

The score statistic is Sn=1nUn(θ~n)In(θ~n)1Un(θ~n),S_n = \frac{1}{n}U_n(\tilde\theta_n)^\top I_n(\tilde\theta_n)^{-1}U_n(\tilde\theta_n), where Un(θ~n)U_n(\tilde\theta_n) is the score vector evaluated at the constrained MLE and In(θ~n)I_n(\tilde\theta_n) is the observed (or expected) Fisher information at the constrained MLE. Under H0H_0, Sndχq2S_n\to_d\chi^2_q.

Equivalently, Sn=nUˉI1UˉS_n = n\bar U^\top I^{-1}\bar U where Uˉ=(1/n)Un(θ~n)\bar U = (1/n)U_n(\tilde\theta_n) is the average score per observation evaluated under the null.

Intuition

The score is the gradient of the log-likelihood. Under the null, the gradient at the constrained MLE is in directions orthogonal to the null parameter (where the constrained MLE has zero score by definition). The score test asks: is the gradient in the directions we constrained meaningfully nonzero? If yes, the null point is not a good fit; reject. The statistic is the squared norm of the relevant score components, weighted by the inverse Fisher information.

Proof Sketch

By the multivariate central limit theorem, (1/n)Un(θ0)dN(0,I(θ0))(1/\sqrt n)U_n(\theta_0)\to_d\mathcal{N}(0, I(\theta_0)) under H0H_0. The score at the constrained MLE is asymptotically equivalent to the score at θ0\theta_0 up to terms of order n1/2n^{-1/2} (the constrained MLE moves with the data). Standardizing by I(θ0)1I(\theta_0)^{-1} gives a quadratic form whose limit is χq2\chi^2_q, where the degrees of freedom are the number of restrictions imposed (the dimension of ψ\psi).

Why It Matters

The score test is computed entirely from the null fit; no alternative-model fit is required. This makes it the test of choice when the alternative is expensive (e.g., when testing whether a regression coefficient should be added: fitting the simpler model is cheap, and the score test for adding a variable uses only that fit). The score test is also the basis of the Pearson Chi-squared test: the Pearson statistic is the score test for goodness of fit against the unrestricted multinomial.

Failure Mode

The score test is invariant under reparameterization of the parameter being tested, but not under reparameterization of the nuisance parameter. It can have low power when the alternative implies a particular non-standard score behavior. The asymptotic Chi-squared is also less accurate in small samples than the LRT for many models; the score test can have inflated or deflated Type I error for moderate nn. Adjustments (Edgeworth expansion, bartlett correction) exist but are model-specific.

Asymptotic Equivalence

Theorem

Asymptotic Equivalence of the Trinity

Statement

Under a regular parametric model and contiguous local alternatives Hn:ψ=ψ0+h/nH_n: \psi = \psi_0 + h/\sqrt n, the three statistics Λn\Lambda_n, WnW_n, SnS_n all converge in distribution to the same noncentral Chi-squared χq2(h[I1]ψψ1h)\chi^2_q(h^\top[I^{-1}]_{\psi\psi}^{-1}h). Under the null (h=0h = 0), the common limit is the central χq2\chi^2_q. The three statistics agree to first order under the null: ΛnWn=Op(n1/2)|\Lambda_n - W_n| = O_p(n^{-1/2}) and ΛnSn=Op(n1/2)|\Lambda_n - S_n| = O_p(n^{-1/2}).

Intuition

All three are second-order approximations to the same local log-likelihood quadratic form. They differ in which side of the maximum they expand from (LRT compares both sides; Wald expands from the unrestricted MLE; score expands from the constrained MLE), but the second-order term is the same to leading order, so the asymptotic distributions agree.

Proof Sketch

Taylor expand the log-likelihood around θ0\theta_0 to second order. The likelihood ratio statistic is the height drop between the unconstrained and constrained maxima, which is a quadratic form in ψ^nψ0\hat\psi_n - \psi_0 times the appropriate Fisher information block. The Wald statistic is the same quadratic form with the standard error replaced by its estimate (consistent estimation: no asymptotic change). The score statistic is the quadratic form in the gradient at the constrained MLE, which by Taylor expansion equals the gradient at the unrestricted MLE plus an Op(n1/2)O_p(n^{-1/2}) correction; the gradient at the unrestricted MLE is zero, so the score statistic is the same quadratic form up to lower-order corrections.

Why It Matters

The equivalence is what justifies any of the three as "the" asymptotic test for the same null. In practice, choose by computational convenience. For small samples, the agreement breaks down, and the LRT typically has the best small-sample size properties; the Wald test has the worst (Hauck-Donner) when the true parameter is far from the null. Empirical studies (Mantel-Haenszel for binary data, mixed-effects models, generalized linear models) consistently favor the LRT or score test over Wald for moderate nn.

Failure Mode

The equivalence requires regularity. At boundary nulls (the variance of a random effect being zero), the trinity diverges: the LRT has a mixture-of-Chi-squareds limit, the Wald test has an asymmetric standard error, and the score test has a one-sided limit. None of the standard pp-values are correct in that case.

Comparison Table

AspectLRT (Wilks)WaldScore (Rao)
Quantity usedBoth null and alternative fitsUnrestricted MLE + standard errorNull fit only
Computed from(θ^)(θ~)\ell(\hat\theta) - \ell(\tilde\theta)ψ^ψ0\hat\psi - \psi_0 and Σ^\hat\SigmaU(θ~)U(\tilde\theta) and I(θ~)I(\tilde\theta)
Reparameterization-invariantYesNo (Hauck-Donner)For parameter of interest only
Small-sample size accuracyBest of the three (typically)Worst when null is farMiddle
Asymptotic distribution under nullχq2\chi^2_qχq2\chi^2_qχq2\chi^2_q
Local powerAll three equivalentAll three equivalentAll three equivalent
Best when computation is hardWhen both fits are cheapWhen you already have the MLEWhen only the null fit is available

For a scalar null hypothesis (q=1q = 1), the signed LRT statistic sign(ψ^ψ0)Λn\operatorname{sign}(\hat\psi - \psi_0)\sqrt{\Lambda_n} is approximately standard Normal. The Wald and score tests in scalar form are zz-tests; their squares are the Chi-squared statistics in the table.

Worked Example: Bernoulli

Let X1,,XnBern(p)X_1,\dots,X_n\sim\operatorname{Bern}(p) and test H0:p=p0H_0: p = p_0 versus H1:pp0H_1: p\ne p_0. Let p^=Xˉn\hat p = \bar X_n.

  • LRT. The unrestricted log-likelihood at p^\hat p is n[p^logp^+(1p^)log(1p^)]n[\hat p\log\hat p + (1-\hat p)\log(1-\hat p)]. The constrained log-likelihood at p0p_0 is n[p^logp0+(1p^)log(1p0)]n[\hat p\log p_0 + (1-\hat p)\log(1-p_0)]. The LRT statistic is Λn=2n[p^logp^p0+(1p^)log1p^1p0].\Lambda_n = 2n\left[\hat p\log\frac{\hat p}{p_0} + (1-\hat p)\log\frac{1-\hat p}{1-p_0}\right].
  • Wald. σ^W2=p^(1p^)/n\hat\sigma_W^2 = \hat p(1-\hat p)/n, so Wn=n(p^p0)2/[p^(1p^)]W_n = n(\hat p - p_0)^2/[\hat p(1-\hat p)].
  • Score. The score per observation at p=p0p = p_0 is (Xip0)/[p0(1p0)](X_i - p_0)/[p_0(1 - p_0)], so Un(p0)/n=(p^p0)/[p0(1p0)]U_n(p_0)/n = (\hat p - p_0)/[p_0(1-p_0)]. The Fisher information per observation at p0p_0 is 1/[p0(1p0)]1/[p_0(1-p_0)]. The score statistic is Sn=n(p^p0)2/[p0(1p0)]S_n = n(\hat p - p_0)^2/[p_0(1-p_0)].

The Wald and score statistics differ only in the denominator: Wald uses p^(1p^)\hat p(1-\hat p) (the estimated variance under the alternative) and score uses p0(1p0)p_0(1-p_0) (the variance under the null). Both are asymptotically equivalent. The LRT uses the full Kullback-Leibler divergence between p^\hat p and p0p_0, which agrees to second order with the others.

In a 95% confidence interval for pp, the Wald interval is p^±1.96p^(1p^)/n\hat p\pm 1.96\sqrt{\hat p(1-\hat p)/n}, which can extend below 0 or above 1 for p^\hat p near the boundary. The score (Wilson) interval, obtained by inverting the score test, is (p^+z2/(2n)±zp^(1p^)/n+z2/(4n2))/(1+z2/n)(\hat p + z^2/(2n)\pm z\sqrt{\hat p(1-\hat p)/n + z^2/(4n^2)})/(1 + z^2/n), which always lies in [0,1][0, 1] and has better small-sample coverage. The Wilson interval is the recommended default.

Common Confusions

Watch Out

Wald p-values can be misleading when the MLE is far from the null

For models with bounded parameter spaces (logistic regression near 0 or 1, variance components near 0), the Wald statistic can paradoxically decrease as the alternative becomes more extreme. The Hauck-Donner effect appears in logistic regression with large effect sizes and is a known reason to prefer LRT-based or profile-likelihood-based tests in those settings.

Watch Out

The LRT factor 2 is the Wilks scaling, not a heuristic

2logΛ-2\log\Lambda has a χq2\chi^2_q limit; logΛ-\log\Lambda does not. The factor of 2 falls out of the second-order Taylor expansion: the height drop in log\log-likelihood is half the squared distance in the standardized parameterization, so doubling recovers the squared-distance scaling that matches Chi-squared.

Watch Out

Score test uses the null fit's Fisher information

The score test evaluates both the gradient and the Fisher information at the constrained MLE under the null, not at the unrestricted MLE. Using the unrestricted MLE's Fisher information in the score statistic gives the Lagrange-multiplier test, which is asymptotically equivalent but a distinct procedure.

Watch Out

Pearson Chi-squared is a score test

The Pearson χ2=(OE)2/E\chi^2 = \sum(O - E)^2/E for multinomial data is exactly the score statistic for testing the multinomial probabilities against the null specification. The likelihood-ratio version is the GG-statistic 2Olog(O/E)2\sum O\log(O/E). Both have χk12\chi^2_{k-1} as their asymptotic null distribution.

Exercises

ExerciseCore

Problem

For a Bernoulli sample with n=100n = 100 and p^=0.42\hat p = 0.42, compute the LRT, Wald, and score statistics for H0:p=0.5H_0: p = 0.5 at level 0.05.

ExerciseCore

Problem

For the Bernoulli example with n=100n = 100 and p^=0.42\hat p = 0.42, construct three 95% confidence intervals: Wald, score (Wilson), and LRT-based (profile-likelihood). Compare their widths and lower endpoints.

ExerciseAdvanced

Problem

Show that the score test for H0:βj=0H_0: \beta_j = 0 in a Normal linear regression model is asymptotically equivalent to the tt-test (squared) on β^j\hat\beta_j.

ExerciseAdvanced

Problem

Construct a one-sided LRT for the variance of a Normal sample: H0:σ2=σ02H_0: \sigma^2 = \sigma_0^2 versus H1:σ2>σ02H_1: \sigma^2 > \sigma_0^2 with μ\mu unknown. Derive the LRT statistic, identify its small-sample distribution under the null, and explain why the asymptotic χ12\chi^2_1 result still applies.

References

Canonical:

  • Casella and Berger, Statistical Inference (2002), Chapter 8 (Section 8.2 on LRT, Section 8.3 on score tests; Chapter 10 on asymptotic results).
  • Lehmann and Romano, Testing Statistical Hypotheses (2005), Chapter 12 (asymptotic optimality of LRT and score tests).
  • Bickel and Doksum, Mathematical Statistics, Volume I (2015), Chapter 6.
  • van der Vaart, Asymptotic Statistics (1998), Chapter 16 (Wilks's theorem and the asymptotic equivalence of the trinity).

Foundational papers:

  • Wilks, "The large-sample distribution of the likelihood ratio for testing composite hypotheses" (Annals of Mathematical Statistics, 1938).
  • Wald, "Tests of statistical hypotheses concerning several parameters when the number of observations is large" (Transactions of the American Mathematical Society, 1943).
  • Rao, "Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation" (Proceedings of the Cambridge Philosophical Society, 1948).

Finite-sample comparisons and pitfalls:

  • Hauck and Donner, "Wald's test as applied to hypotheses in logit analysis" (JASA, 1977), the Wald-test paradox.
  • Self and Liang, "Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions" (JASA, 1987), boundary nulls.

Last reviewed: May 11, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.