Statistical Estimation
Likelihood-Ratio, Wald, and Score Tests
The three asymptotic tests built from the likelihood: LRT compares maximized likelihoods, Wald compares the MLE to the null using the inverse Fisher information, score uses the gradient of the log-likelihood at the null. All three are asymptotically Chi-squared and equivalent under regularity; they disagree in finite samples and choice depends on the problem.
Prerequisites
Why This Matters
Three asymptotic tests dominate parametric hypothesis testing: the likelihood-ratio test (LRT, Wilks), the Wald test, and the score test (Rao). All three are derived from the likelihood, all three are asymptotically Chi-squared under regularity, and all three are asymptotically equivalent under the null. They differ in finite samples and in which numerical quantity they require: maximizing the likelihood under the null and the alternative (LRT), under the alternative only (Wald), or under the null only (score).
The three statistics test the same hypothesis using different parts of the log-likelihood surface. The LRT compares the heights of the surface at the constrained and unconstrained maxima. The Wald test measures the horizontal distance from the unconstrained maximum to the null point, weighted by the local curvature. The score test measures the slope of the surface at the null point, weighted by the same curvature. Far from the null, the three statistics often disagree noticeably; close to the null they agree to leading order.
The choice among them is computational and structural:
- Use the LRT when you can fit both the null and the alternative model and want the most "correct" test in the small-sample regime. The LRT is invariant under reparameterization, which is its main theoretical advantage.
- Use the Wald test when you have the unrestricted MLE and its standard error. This is what most regression-output tables produce automatically.
- Use the score test when fitting the alternative is expensive but fitting the null is cheap. The score test is computed entirely from the null fit and is invariant under reparameterizations of the parameters being tested (but not under reparameterizations of nuisance parameters).
This page derives each, states the asymptotic Chi-squared distribution, and explains the equivalence.
Setup
Let index a parametric family . Partition where is the parameter of interest and is nuisance. Test Let be the log-likelihood for an i.i.d. sample.
Three estimators matter. The unrestricted MLE . The constrained MLE under , where . The score vector and the observed (or expected) Fisher information matrix (or its expectation).
The Likelihood-Ratio Test
Wilks's Likelihood-Ratio Asymptotics
Statement
The likelihood-ratio statistic satisfies under as , where is the number of restrictions imposed by .
Intuition
is twice the height difference between the unconstrained and constrained log-likelihood. Under the null, the constraint is true and the height drop is small; expand the log-likelihood to second order around the unrestricted MLE and the drop is a quadratic form in the constraint, asymptotically Chi-squared.
Proof Sketch
Taylor expand around to second order: where is the per-observation Fisher information. Evaluating at and noting that the constrained MLE differs from the unconstrained MLE by an correction, the quadratic form has the form up to errors. By MLE asymptotic normality, , so the quadratic form converges to a .
Why It Matters
The LRT is the natural test from the Neyman-Pearson perspective: it generalizes the simple-versus-simple most-powerful test to composite hypotheses. Under regularity, Wilks's theorem guarantees the asymptotic Chi-squared reference, and the test is asymptotically locally most powerful in many parametric models. The factor exists so the leading-order quadratic form has the natural Chi-squared scaling.
Failure Mode
Wilks's theorem fails at parameter-space boundaries (the asymptotic distribution becomes a mixture of Chi-squareds with different degrees of freedom; this is the Chernoff-1954 phenomenon). It also fails when the model is not regular (parameters on a boundary, identifiability failures, non-differentiable likelihoods). For boundary problems, see Self-Liang (1987). For non-regular problems, the LRT typically has a non-standard limiting distribution that must be derived case by case.
The Wald Test
Wald Test Asymptotics
Statement
The Wald statistic is where is a consistent estimator of the asymptotic covariance of (typically the block of the inverse observed or expected Fisher information at the unrestricted MLE). Under , .
Intuition
standardizes the unrestricted MLE by its estimated standard error and squares the result. It is the multivariate statistic: how many standard deviations is the MLE from the null value, squared and summed across components. Under the null and large samples, the standardized MLE is approximately standard Normal, so its squared norm is approximately Chi-squared.
Proof Sketch
MLE asymptotic normality: where . Under the null, , so . The standardized statistic . Replacing by a consistent estimator preserves the limit by Slutsky's theorem.
Why It Matters
Most regression-output software (R's summary.lm, Python's statsmodels, Stata) reports Wald-style test statistics by default, because all that is needed is the MLE and its estimated covariance. The -values associated with individual regression coefficients are Wald tests on each coefficient against zero. The Wald test is the basis for the standard confidence interval in the scalar case.
Failure Mode
The Wald test is not invariant under reparameterization. If you transform by a smooth nonlinear map , the Wald statistic for can give a different -value than the original Wald statistic for , even though the hypotheses are equivalent. This is the Hauck-Donner phenomenon: when the unrestricted MLE is far from the null, the Wald statistic can shrink toward zero, paradoxically reducing rejection. The LRT does not have this problem.
The Score Test
Score (Rao) Test Asymptotics
Statement
The score statistic is where is the score vector evaluated at the constrained MLE and is the observed (or expected) Fisher information at the constrained MLE. Under , .
Equivalently, where is the average score per observation evaluated under the null.
Intuition
The score is the gradient of the log-likelihood. Under the null, the gradient at the constrained MLE is in directions orthogonal to the null parameter (where the constrained MLE has zero score by definition). The score test asks: is the gradient in the directions we constrained meaningfully nonzero? If yes, the null point is not a good fit; reject. The statistic is the squared norm of the relevant score components, weighted by the inverse Fisher information.
Proof Sketch
By the multivariate central limit theorem, under . The score at the constrained MLE is asymptotically equivalent to the score at up to terms of order (the constrained MLE moves with the data). Standardizing by gives a quadratic form whose limit is , where the degrees of freedom are the number of restrictions imposed (the dimension of ).
Why It Matters
The score test is computed entirely from the null fit; no alternative-model fit is required. This makes it the test of choice when the alternative is expensive (e.g., when testing whether a regression coefficient should be added: fitting the simpler model is cheap, and the score test for adding a variable uses only that fit). The score test is also the basis of the Pearson Chi-squared test: the Pearson statistic is the score test for goodness of fit against the unrestricted multinomial.
Failure Mode
The score test is invariant under reparameterization of the parameter being tested, but not under reparameterization of the nuisance parameter. It can have low power when the alternative implies a particular non-standard score behavior. The asymptotic Chi-squared is also less accurate in small samples than the LRT for many models; the score test can have inflated or deflated Type I error for moderate . Adjustments (Edgeworth expansion, bartlett correction) exist but are model-specific.
Asymptotic Equivalence
Asymptotic Equivalence of the Trinity
Statement
Under a regular parametric model and contiguous local alternatives , the three statistics , , all converge in distribution to the same noncentral Chi-squared . Under the null (), the common limit is the central . The three statistics agree to first order under the null: and .
Intuition
All three are second-order approximations to the same local log-likelihood quadratic form. They differ in which side of the maximum they expand from (LRT compares both sides; Wald expands from the unrestricted MLE; score expands from the constrained MLE), but the second-order term is the same to leading order, so the asymptotic distributions agree.
Proof Sketch
Taylor expand the log-likelihood around to second order. The likelihood ratio statistic is the height drop between the unconstrained and constrained maxima, which is a quadratic form in times the appropriate Fisher information block. The Wald statistic is the same quadratic form with the standard error replaced by its estimate (consistent estimation: no asymptotic change). The score statistic is the quadratic form in the gradient at the constrained MLE, which by Taylor expansion equals the gradient at the unrestricted MLE plus an correction; the gradient at the unrestricted MLE is zero, so the score statistic is the same quadratic form up to lower-order corrections.
Why It Matters
The equivalence is what justifies any of the three as "the" asymptotic test for the same null. In practice, choose by computational convenience. For small samples, the agreement breaks down, and the LRT typically has the best small-sample size properties; the Wald test has the worst (Hauck-Donner) when the true parameter is far from the null. Empirical studies (Mantel-Haenszel for binary data, mixed-effects models, generalized linear models) consistently favor the LRT or score test over Wald for moderate .
Failure Mode
The equivalence requires regularity. At boundary nulls (the variance of a random effect being zero), the trinity diverges: the LRT has a mixture-of-Chi-squareds limit, the Wald test has an asymmetric standard error, and the score test has a one-sided limit. None of the standard -values are correct in that case.
Comparison Table
| Aspect | LRT (Wilks) | Wald | Score (Rao) |
|---|---|---|---|
| Quantity used | Both null and alternative fits | Unrestricted MLE + standard error | Null fit only |
| Computed from | and | and | |
| Reparameterization-invariant | Yes | No (Hauck-Donner) | For parameter of interest only |
| Small-sample size accuracy | Best of the three (typically) | Worst when null is far | Middle |
| Asymptotic distribution under null | |||
| Local power | All three equivalent | All three equivalent | All three equivalent |
| Best when computation is hard | When both fits are cheap | When you already have the MLE | When only the null fit is available |
For a scalar null hypothesis (), the signed LRT statistic is approximately standard Normal. The Wald and score tests in scalar form are -tests; their squares are the Chi-squared statistics in the table.
Worked Example: Bernoulli
Let and test versus . Let .
- LRT. The unrestricted log-likelihood at is . The constrained log-likelihood at is . The LRT statistic is
- Wald. , so .
- Score. The score per observation at is , so . The Fisher information per observation at is . The score statistic is .
The Wald and score statistics differ only in the denominator: Wald uses (the estimated variance under the alternative) and score uses (the variance under the null). Both are asymptotically equivalent. The LRT uses the full Kullback-Leibler divergence between and , which agrees to second order with the others.
In a 95% confidence interval for , the Wald interval is , which can extend below 0 or above 1 for near the boundary. The score (Wilson) interval, obtained by inverting the score test, is , which always lies in and has better small-sample coverage. The Wilson interval is the recommended default.
Common Confusions
Wald p-values can be misleading when the MLE is far from the null
For models with bounded parameter spaces (logistic regression near 0 or 1, variance components near 0), the Wald statistic can paradoxically decrease as the alternative becomes more extreme. The Hauck-Donner effect appears in logistic regression with large effect sizes and is a known reason to prefer LRT-based or profile-likelihood-based tests in those settings.
The LRT factor 2 is the Wilks scaling, not a heuristic
has a limit; does not. The factor of 2 falls out of the second-order Taylor expansion: the height drop in -likelihood is half the squared distance in the standardized parameterization, so doubling recovers the squared-distance scaling that matches Chi-squared.
Score test uses the null fit's Fisher information
The score test evaluates both the gradient and the Fisher information at the constrained MLE under the null, not at the unrestricted MLE. Using the unrestricted MLE's Fisher information in the score statistic gives the Lagrange-multiplier test, which is asymptotically equivalent but a distinct procedure.
Pearson Chi-squared is a score test
The Pearson for multinomial data is exactly the score statistic for testing the multinomial probabilities against the null specification. The likelihood-ratio version is the -statistic . Both have as their asymptotic null distribution.
Exercises
Problem
For a Bernoulli sample with and , compute the LRT, Wald, and score statistics for at level 0.05.
Problem
For the Bernoulli example with and , construct three 95% confidence intervals: Wald, score (Wilson), and LRT-based (profile-likelihood). Compare their widths and lower endpoints.
Problem
Show that the score test for in a Normal linear regression model is asymptotically equivalent to the -test (squared) on .
Problem
Construct a one-sided LRT for the variance of a Normal sample: versus with unknown. Derive the LRT statistic, identify its small-sample distribution under the null, and explain why the asymptotic result still applies.
References
Canonical:
- Casella and Berger, Statistical Inference (2002), Chapter 8 (Section 8.2 on LRT, Section 8.3 on score tests; Chapter 10 on asymptotic results).
- Lehmann and Romano, Testing Statistical Hypotheses (2005), Chapter 12 (asymptotic optimality of LRT and score tests).
- Bickel and Doksum, Mathematical Statistics, Volume I (2015), Chapter 6.
- van der Vaart, Asymptotic Statistics (1998), Chapter 16 (Wilks's theorem and the asymptotic equivalence of the trinity).
Foundational papers:
- Wilks, "The large-sample distribution of the likelihood ratio for testing composite hypotheses" (Annals of Mathematical Statistics, 1938).
- Wald, "Tests of statistical hypotheses concerning several parameters when the number of observations is large" (Transactions of the American Mathematical Society, 1943).
- Rao, "Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation" (Proceedings of the Cambridge Philosophical Society, 1948).
Finite-sample comparisons and pitfalls:
- Hauck and Donner, "Wald's test as applied to hypotheses in logit analysis" (JASA, 1977), the Wald-test paradox.
- Self and Liang, "Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions" (JASA, 1987), boundary nulls.
Last reviewed: May 11, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
5- Asymptotic Statistics: M-Estimators, Delta Method, LANlayer 0B · tier 1
- Fisher Information: Curvature, KL Geometry, and the Natural Gradientlayer 0B · tier 1
- Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
- Chi-Squared Distribution and Testslayer 1 · tier 1
- Neyman-Pearson and Hypothesis Testing Theorylayer 2 · tier 2
Derived topics
0No published topic currently declares this as a prerequisite.