Likelihood-Ratio, Wald, and Score Tests

Sneiderman, Robby

Statistical Estimation

Likelihood-Ratio, Wald, and Score Tests

The three asymptotic tests built from the likelihood: LRT compares maximized likelihoods, Wald compares the MLE to the null using the inverse Fisher information, score uses the gradient of the log-likelihood at the null. All three are asymptotically Chi-squared and equivalent under regularity; they disagree in finite samples and choice depends on the problem.

ImportantAdvancedTier 1StableCore spine~55 min

For:StatsActuarial

Prerequisites

Maximum Likelihood Estimation Fisher Information Neyman Pearson and Hypothesis Testing Theory Asymptotic Statistics

Prereq Map

Why This Matters

Three asymptotic tests dominate parametric hypothesis testing: the likelihood-ratio test (LRT, Wilks), the Wald test, and the score test (Rao). All three are derived from the likelihood, all three are asymptotically Chi-squared under regularity, and all three are asymptotically equivalent under the null. They differ in finite samples and in which numerical quantity they require: maximizing the likelihood under the null and the alternative (LRT), under the alternative only (Wald), or under the null only (score).

The three statistics test the same hypothesis using different parts of the log-likelihood surface. The LRT compares the heights of the surface at the constrained and unconstrained maxima. The Wald test measures the horizontal distance from the unconstrained maximum to the null point, weighted by the local curvature. The score test measures the slope of the surface at the null point, weighted by the same curvature. Far from the null, the three statistics often disagree noticeably; close to the null they agree to leading order.

The choice among them is computational and structural:

Use the LRT when you can fit both the null and the alternative model and want the most "correct" test in the small-sample regime. The LRT is invariant under reparameterization, which is its main theoretical advantage.
Use the Wald test when you have the unrestricted MLE and its standard error. This is what most regression-output tables produce automatically.
Use the score test when fitting the alternative is expensive but fitting the null is cheap. The score test is computed entirely from the null fit and is invariant under reparameterizations of the parameters being tested (but not under reparameterizations of nuisance parameters).

This page derives each, states the asymptotic Chi-squared distribution, and explains the equivalence.

Setup

Let $\theta\in\Theta\subset\mathbb{R}^p$ index a parametric family $p(\cdot;\theta)$ . Partition $\theta = (\psi, \eta)$ where $\psi\in\mathbb{R}^q$ is the parameter of interest and $\eta\in\mathbb{R}^{p-q}$ is nuisance. Test $H_0: \psi = \psi_0\qquad\text{versus}\qquad H_1: \psi\ne\psi_0.$ Let $\ell_n(\theta) = \sum_{i=1}^n\log p(X_i;\theta)$ be the log-likelihood for an i.i.d. sample.

Three estimators matter. The unrestricted MLE $\hat\theta_n = \arg\max_\theta\ell_n(\theta) = (\hat\psi_n, \hat\eta_n)$ . The constrained MLE under $H_0$ , $\tilde\theta_n = (\psi_0, \tilde\eta_n)$ where $\tilde\eta_n = \arg\max_\eta\ell_n(\psi_0, \eta)$ . The score vector $U_n(\theta) = \partial\ell_n/\partial\theta$ and the observed (or expected) Fisher information matrix $I_n(\theta) = -\partial^2\ell_n/\partial\theta^2$ (or its expectation).

The Likelihood-Ratio Test

Theorem

Wilks's Likelihood-Ratio Asymptotics

Statement

The likelihood-ratio statistic $\Lambda_n = -2\log\frac{\sup_{\theta\in\Theta_0}L_n(\theta)}{\sup_{\theta\in\Theta}L_n(\theta)} = 2[\ell_n(\hat\theta_n) - \ell_n(\tilde\theta_n)]$ satisfies $\Lambda_n\to_d\chi^2_q$ under $H_0$ as $n\to\infty$ , where $q$ is the number of restrictions imposed by $H_0$ .

Intuition

$\Lambda_n$ is twice the height difference between the unconstrained and constrained log-likelihood. Under the null, the constraint is true and the height drop is small; expand the log-likelihood to second order around the unrestricted MLE and the drop is a quadratic form in the constraint, asymptotically Chi-squared.

Proof Sketch

Taylor expand $\ell_n(\theta)$ around $\hat\theta_n$ to second order: $\ell_n(\theta)\approx\ell_n(\hat\theta_n) - \tfrac12 n(\theta - \hat\theta_n)^\top I(\theta_0)(\theta - \hat\theta_n)$ where $I(\theta_0)$ is the per-observation Fisher information. Evaluating at $\theta = \tilde\theta_n$ and noting that the constrained MLE differs from the unconstrained MLE by an $O_p(n^{-1/2})$ correction, the quadratic form has the form $n(\hat\psi_n - \psi_0)^\top[I(\theta_0)^{-1}]_{\psi\psi}^{-1}(\hat\psi_n - \psi_0)$ up to $O_p(n^{-1})$ errors. By MLE asymptotic normality, $\sqrt n(\hat\psi_n - \psi_0)\to_d\mathcal{N}(0, [I^{-1}]_{\psi\psi})$ , so the quadratic form converges to a $\chi^2_q$ .

Why It Matters

The LRT is the natural test from the Neyman-Pearson perspective: it generalizes the simple-versus-simple most-powerful test to composite hypotheses. Under regularity, Wilks's theorem guarantees the asymptotic Chi-squared reference, and the test is asymptotically locally most powerful in many parametric models. The factor $-2$ exists so the leading-order quadratic form has the natural Chi-squared scaling.

Failure Mode

Wilks's theorem fails at parameter-space boundaries (the asymptotic distribution becomes a mixture of Chi-squareds with different degrees of freedom; this is the Chernoff-1954 phenomenon). It also fails when the model is not regular (parameters on a boundary, identifiability failures, non-differentiable likelihoods). For boundary problems, see Self-Liang (1987). For non-regular problems, the LRT typically has a non-standard limiting distribution that must be derived case by case.

report a correction →

The Wald Test

Theorem

Wald Test Asymptotics

Statement

The Wald statistic is $W_n = n(\hat\psi_n - \psi_0)^\top \hat\Sigma_n^{-1}(\hat\psi_n - \psi_0),$ where $\hat\Sigma_n$ is a consistent estimator of the asymptotic covariance of $\sqrt n\hat\psi_n$ (typically the $\psi\psi$ block of the inverse observed or expected Fisher information at the unrestricted MLE). Under $H_0$ , $W_n\to_d\chi^2_q$ .

Intuition

$W_n$ standardizes the unrestricted MLE $\hat\psi_n$ by its estimated standard error and squares the result. It is the multivariate $z^2$ statistic: how many standard deviations is the MLE from the null value, squared and summed across components. Under the null and large samples, the standardized MLE is approximately standard Normal, so its squared norm is approximately Chi-squared.

Proof Sketch

MLE asymptotic normality: $\sqrt n(\hat\psi_n - \psi)\to_d\mathcal{N}(0, \Sigma)$ where $\Sigma = [I(\theta_0)^{-1}]_{\psi\psi}$ . Under the null, $\psi = \psi_0$ , so $\sqrt n(\hat\psi_n - \psi_0)\to_d\mathcal{N}(0, \Sigma)$ . The standardized statistic $n(\hat\psi_n - \psi_0)^\top\Sigma^{-1}(\hat\psi_n - \psi_0)\to_d\chi^2_q$ . Replacing $\Sigma$ by a consistent estimator $\hat\Sigma_n$ preserves the limit by Slutsky's theorem.

Why It Matters

Most regression-output software (R's summary.lm, Python's statsmodels, Stata) reports Wald-style test statistics by default, because all that is needed is the MLE and its estimated covariance. The $p$ -values associated with individual regression coefficients are Wald tests on each coefficient against zero. The Wald test is the basis for the standard confidence interval $\hat\psi_n\pm 1.96\cdot\hat{\text{SE}}$ in the scalar case.

Failure Mode

The Wald test is not invariant under reparameterization. If you transform $\psi$ by a smooth nonlinear map $g$ , the Wald statistic for $H_0: g(\psi) = g(\psi_0)$ can give a different $p$ -value than the original Wald statistic for $H_0: \psi = \psi_0$ , even though the hypotheses are equivalent. This is the Hauck-Donner phenomenon: when the unrestricted MLE is far from the null, the Wald statistic can shrink toward zero, paradoxically reducing rejection. The LRT does not have this problem.

report a correction →

The Score Test

Theorem

Score (Rao) Test Asymptotics

Statement

The score statistic is $S_n = \frac{1}{n}U_n(\tilde\theta_n)^\top I_n(\tilde\theta_n)^{-1}U_n(\tilde\theta_n),$ where $U_n(\tilde\theta_n)$ is the score vector evaluated at the constrained MLE and $I_n(\tilde\theta_n)$ is the observed (or expected) Fisher information at the constrained MLE. Under $H_0$ , $S_n\to_d\chi^2_q$ .

Equivalently, $S_n = n\bar U^\top I^{-1}\bar U$ where $\bar U = (1/n)U_n(\tilde\theta_n)$ is the average score per observation evaluated under the null.

Intuition

The score is the gradient of the log-likelihood. Under the null, the gradient at the constrained MLE is in directions orthogonal to the null parameter (where the constrained MLE has zero score by definition). The score test asks: is the gradient in the directions we constrained meaningfully nonzero? If yes, the null point is not a good fit; reject. The statistic is the squared norm of the relevant score components, weighted by the inverse Fisher information.

Proof Sketch

By the multivariate central limit theorem, $(1/\sqrt n)U_n(\theta_0)\to_d\mathcal{N}(0, I(\theta_0))$ under $H_0$ . The score at the constrained MLE is asymptotically equivalent to the score at $\theta_0$ up to terms of order $n^{-1/2}$ (the constrained MLE moves with the data). Standardizing by $I(\theta_0)^{-1}$ gives a quadratic form whose limit is $\chi^2_q$ , where the degrees of freedom are the number of restrictions imposed (the dimension of $\psi$ ).

Why It Matters

The score test is computed entirely from the null fit; no alternative-model fit is required. This makes it the test of choice when the alternative is expensive (e.g., when testing whether a regression coefficient should be added: fitting the simpler model is cheap, and the score test for adding a variable uses only that fit). The score test is also the basis of the Pearson Chi-squared test: the Pearson statistic is the score test for goodness of fit against the unrestricted multinomial.

Failure Mode

The score test is invariant under reparameterization of the parameter being tested, but not under reparameterization of the nuisance parameter. It can have low power when the alternative implies a particular non-standard score behavior. The asymptotic Chi-squared is also less accurate in small samples than the LRT for many models; the score test can have inflated or deflated Type I error for moderate $n$ . Adjustments (Edgeworth expansion, bartlett correction) exist but are model-specific.

report a correction →

Asymptotic Equivalence

Theorem

Asymptotic Equivalence of the Trinity

Statement

Under a regular parametric model and contiguous local alternatives $H_n: \psi = \psi_0 + h/\sqrt n$ , the three statistics $\Lambda_n$ , $W_n$ , $S_n$ all converge in distribution to the same noncentral Chi-squared $\chi^2_q(h^\top[I^{-1}]_{\psi\psi}^{-1}h)$ . Under the null ( $h = 0$ ), the common limit is the central $\chi^2_q$ . The three statistics agree to first order under the null: $|\Lambda_n - W_n| = O_p(n^{-1/2})$ and $|\Lambda_n - S_n| = O_p(n^{-1/2})$ .

Intuition

All three are second-order approximations to the same local log-likelihood quadratic form. They differ in which side of the maximum they expand from (LRT compares both sides; Wald expands from the unrestricted MLE; score expands from the constrained MLE), but the second-order term is the same to leading order, so the asymptotic distributions agree.

Proof Sketch

Taylor expand the log-likelihood around $\theta_0$ to second order. The likelihood ratio statistic is the height drop between the unconstrained and constrained maxima, which is a quadratic form in $\hat\psi_n - \psi_0$ times the appropriate Fisher information block. The Wald statistic is the same quadratic form with the standard error replaced by its estimate (consistent estimation: no asymptotic change). The score statistic is the quadratic form in the gradient at the constrained MLE, which by Taylor expansion equals the gradient at the unrestricted MLE plus an $O_p(n^{-1/2})$ correction; the gradient at the unrestricted MLE is zero, so the score statistic is the same quadratic form up to lower-order corrections.

Why It Matters

The equivalence is what justifies any of the three as "the" asymptotic test for the same null. In practice, choose by computational convenience. For small samples, the agreement breaks down, and the LRT typically has the best small-sample size properties; the Wald test has the worst (Hauck-Donner) when the true parameter is far from the null. Empirical studies (Mantel-Haenszel for binary data, mixed-effects models, generalized linear models) consistently favor the LRT or score test over Wald for moderate $n$ .

Failure Mode

The equivalence requires regularity. At boundary nulls (the variance of a random effect being zero), the trinity diverges: the LRT has a mixture-of-Chi-squareds limit, the Wald test has an asymmetric standard error, and the score test has a one-sided limit. None of the standard $p$ -values are correct in that case.

report a correction →

Comparison Table

Aspect	LRT (Wilks)	Wald	Score (Rao)
Quantity used	Both null and alternative fits	Unrestricted MLE + standard error	Null fit only
Computed from	$\ell(\hat\theta) - \ell(\tilde\theta)$	$\hat\psi - \psi_0$ and $\hat\Sigma$	$U(\tilde\theta)$ and $I(\tilde\theta)$
Reparameterization-invariant	Yes	No (Hauck-Donner)	For parameter of interest only
Small-sample size accuracy	Best of the three (typically)	Worst when null is far	Middle
Asymptotic distribution under null	$\chi^2_q$	$\chi^2_q$	$\chi^2_q$
Local power	All three equivalent	All three equivalent	All three equivalent
Best when computation is hard	When both fits are cheap	When you already have the MLE	When only the null fit is available

For a scalar null hypothesis ( $q = 1$ ), the signed LRT statistic $\operatorname{sign}(\hat\psi - \psi_0)\sqrt{\Lambda_n}$ is approximately standard Normal. The Wald and score tests in scalar form are $z$ -tests; their squares are the Chi-squared statistics in the table.

Worked Example: Bernoulli

Let $X_1,\dots,X_n\sim\operatorname{Bern}(p)$ and test $H_0: p = p_0$ versus $H_1: p\ne p_0$ . Let $\hat p = \bar X_n$ .

LRT. The unrestricted log-likelihood at $\hat p$ is $n[\hat p\log\hat p + (1-\hat p)\log(1-\hat p)]$ . The constrained log-likelihood at $p_0$ is $n[\hat p\log p_0 + (1-\hat p)\log(1-p_0)]$ . The LRT statistic is $\Lambda_n = 2n\left[\hat p\log\frac{\hat p}{p_0} + (1-\hat p)\log\frac{1-\hat p}{1-p_0}\right].$
Wald. $\hat\sigma_W^2 = \hat p(1-\hat p)/n$ , so $W_n = n(\hat p - p_0)^2/[\hat p(1-\hat p)]$ .
Score. The score per observation at $p = p_0$ is $(X_i - p_0)/[p_0(1 - p_0)]$ , so $U_n(p_0)/n = (\hat p - p_0)/[p_0(1-p_0)]$ . The Fisher information per observation at $p_0$ is $1/[p_0(1-p_0)]$ . The score statistic is $S_n = n(\hat p - p_0)^2/[p_0(1-p_0)]$ .

The Wald and score statistics differ only in the denominator: Wald uses $\hat p(1-\hat p)$ (the estimated variance under the alternative) and score uses $p_0(1-p_0)$ (the variance under the null). Both are asymptotically equivalent. The LRT uses the full Kullback-Leibler divergence between $\hat p$ and $p_0$ , which agrees to second order with the others.

In a 95% confidence interval for $p$ , the Wald interval is $\hat p\pm 1.96\sqrt{\hat p(1-\hat p)/n}$ , which can extend below 0 or above 1 for $\hat p$ near the boundary. The score (Wilson) interval, obtained by inverting the score test, is $(\hat p + z^2/(2n)\pm z\sqrt{\hat p(1-\hat p)/n + z^2/(4n^2)})/(1 + z^2/n)$ , which always lies in $[0, 1]$ and has better small-sample coverage. The Wilson interval is the recommended default.

Common Confusions

Watch Out

Wald p-values can be misleading when the MLE is far from the null

For models with bounded parameter spaces (logistic regression near 0 or 1, variance components near 0), the Wald statistic can paradoxically decrease as the alternative becomes more extreme. The Hauck-Donner effect appears in logistic regression with large effect sizes and is a known reason to prefer LRT-based or profile-likelihood-based tests in those settings.

Watch Out

The LRT factor 2 is the Wilks scaling, not a heuristic

$-2\log\Lambda$ has a $\chi^2_q$ limit; $-\log\Lambda$ does not. The factor of 2 falls out of the second-order Taylor expansion: the height drop in $\log$ -likelihood is half the squared distance in the standardized parameterization, so doubling recovers the squared-distance scaling that matches Chi-squared.

Watch Out

Score test uses the null fit's Fisher information

The score test evaluates both the gradient and the Fisher information at the constrained MLE under the null, not at the unrestricted MLE. Using the unrestricted MLE's Fisher information in the score statistic gives the Lagrange-multiplier test, which is asymptotically equivalent but a distinct procedure.

Watch Out

Pearson Chi-squared is a score test

The Pearson $\chi^2 = \sum(O - E)^2/E$ for multinomial data is exactly the score statistic for testing the multinomial probabilities against the null specification. The likelihood-ratio version is the $G$ -statistic $2\sum O\log(O/E)$ . Both have $\chi^2_{k-1}$ as their asymptotic null distribution.

Exercises

ExerciseCore

Problem

For a Bernoulli sample with $n = 100$ and $\hat p = 0.42$ , compute the LRT, Wald, and score statistics for $H_0: p = 0.5$ at level 0.05.

ExerciseCore

Problem

For the Bernoulli example with $n = 100$ and $\hat p = 0.42$ , construct three 95% confidence intervals: Wald, score (Wilson), and LRT-based (profile-likelihood). Compare their widths and lower endpoints.

ExerciseAdvanced

Problem

Show that the score test for $H_0: \beta_j = 0$ in a Normal linear regression model is asymptotically equivalent to the $t$ -test (squared) on $\hat\beta_j$ .

ExerciseAdvanced

Problem

Construct a one-sided LRT for the variance of a Normal sample: $H_0: \sigma^2 = \sigma_0^2$ versus $H_1: \sigma^2 > \sigma_0^2$ with $\mu$ unknown. Derive the LRT statistic, identify its small-sample distribution under the null, and explain why the asymptotic $\chi^2_1$ result still applies.

References

Canonical:

Casella and Berger, Statistical Inference (2002), Chapter 8 (Section 8.2 on LRT, Section 8.3 on score tests; Chapter 10 on asymptotic results).
Lehmann and Romano, Testing Statistical Hypotheses (2005), Chapter 12 (asymptotic optimality of LRT and score tests).
Bickel and Doksum, Mathematical Statistics, Volume I (2015), Chapter 6.
van der Vaart, Asymptotic Statistics (1998), Chapter 16 (Wilks's theorem and the asymptotic equivalence of the trinity).

Foundational papers:

Wilks, "The large-sample distribution of the likelihood ratio for testing composite hypotheses" (Annals of Mathematical Statistics, 1938).
Wald, "Tests of statistical hypotheses concerning several parameters when the number of observations is large" (Transactions of the American Mathematical Society, 1943).
Rao, "Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation" (Proceedings of the Cambridge Philosophical Society, 1948).

Finite-sample comparisons and pitfalls:

Hauck and Donner, "Wald's test as applied to hypotheses in logit analysis" (JASA, 1977), the Wald-test paradox.
Self and Liang, "Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions" (JASA, 1987), boundary nulls.

Last reviewed: May 11, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Asymptotic Statistics: M-Estimators, Delta Method, LANlayer 0B · tier 1
Fisher Information: Curvature, KL Geometry, and the Natural Gradientlayer 0B · tier 1
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
Chi-Squared Distribution and Testslayer 1 · tier 1
Neyman-Pearson and Hypothesis Testing Theorylayer 2 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.