Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Statistical Foundations

Neyman-Pearson and Hypothesis Testing Theory

The likelihood ratio test is the most powerful test for simple hypotheses (Neyman-Pearson lemma), UMP tests extend this to one-sided composites, and the power function characterizes a test's behavior across the parameter space.

CoreTier 2Stable~50 min
0

Why This Matters

Hypothesis testing is the formal framework for making binary decisions from data: is the drug effective? Is the model better than the baseline? The Neyman-Pearson lemma answers a precise optimization question: among all tests with Type I error at most α\alpha, which test has the highest probability of correctly rejecting a false null hypothesis?

The answer is the likelihood ratio test. This result is the foundation for understanding power analysis, sample size calculations, and the deep connection between hypothesis testing and binary classification.

Formal Setup

Definition

Hypothesis Test

A hypothesis test for H0:θ=θ0H_0: \theta = \theta_0 versus H1:θ=θ1H_1: \theta = \theta_1 is a function ϕ:X[0,1]\phi: \mathcal{X} \to [0, 1] where ϕ(x)\phi(x) is the probability of rejecting H0H_0 given data xx. The size (Type I error rate) is α(ϕ)=Eθ0[ϕ(X)]\alpha(\phi) = \mathbb{E}_{\theta_0}[\phi(X)]. The power against θ1\theta_1 is β(ϕ)=Eθ1[ϕ(X)]\beta(\phi) = \mathbb{E}_{\theta_1}[\phi(X)].

A test ϕ\phi has level α\alpha if Eθ0[ϕ(X)]α\mathbb{E}_{\theta_0}[\phi(X)] \leq \alpha.

Definition

Power Function

The power function of a test ϕ\phi is:

βϕ(θ)=Eθ[ϕ(X)]=Pθ(reject H0)\beta_\phi(\theta) = \mathbb{E}_\theta[\phi(X)] = P_\theta(\text{reject } H_0)

For a good test, βϕ(θ0)α\beta_\phi(\theta_0) \leq \alpha (controlled Type I error) and βϕ(θ)\beta_\phi(\theta) is large for θ\theta far from θ0\theta_0 (high power against alternatives).

Main Theorems

Lemma

Neyman-Pearson Lemma

Statement

For testing H0:θ=θ0H_0: \theta = \theta_0 vs H1:θ=θ1H_1: \theta = \theta_1 at level α\alpha, the test that rejects when the likelihood ratio exceeds a threshold:

ϕ(x)={1if f1(x)f0(x)>kγif f1(x)f0(x)=k0if f1(x)f0(x)<k\phi^*(x) = \begin{cases} 1 & \text{if } \frac{f_1(x)}{f_0(x)} > k \\ \gamma & \text{if } \frac{f_1(x)}{f_0(x)} = k \\ 0 & \text{if } \frac{f_1(x)}{f_0(x)} < k \end{cases}

where kk and γ\gamma are chosen so that Eθ0[ϕ(X)]=α\mathbb{E}_{\theta_0}[\phi^*(X)] = \alpha, is the most powerful level-α\alpha test. That is, for any other test ϕ\phi with Eθ0[ϕ(X)]α\mathbb{E}_{\theta_0}[\phi(X)] \leq \alpha:

Eθ1[ϕ(X)]Eθ1[ϕ(X)]\mathbb{E}_{\theta_1}[\phi^*(X)] \geq \mathbb{E}_{\theta_1}[\phi(X)]

Intuition

The likelihood ratio f1(x)/f0(x)f_1(x)/f_0(x) measures how much more likely the data xx is under H1H_1 than under H0H_0. Rejecting when this ratio is large is the optimal strategy for distinguishing the two hypotheses. Data points where H1H_1 is much more likely than H0H_0 provide the strongest evidence against H0H_0, so the test allocates its rejection budget to these points first.

Proof Sketch

Let ϕ\phi be any level-α\alpha test. We need to show E1[ϕ]E1[ϕ]\mathbb{E}_1[\phi^*] \geq \mathbb{E}_1[\phi]. Write E1[ϕϕ]=(ϕϕ)(f1kf0)dμ+k(ϕϕ)f0dμ\mathbb{E}_1[\phi^* - \phi] = \int (\phi^* - \phi)(f_1 - k f_0) d\mu + k \int (\phi^* - \phi) f_0 d\mu. By construction of ϕ\phi^*: when f1>kf0f_1 > k f_0, ϕ=1ϕ\phi^* = 1 \geq \phi; when f1<kf0f_1 < k f_0, ϕ=0ϕ\phi^* = 0 \leq \phi. So (ϕϕ)(f1kf0)0(\phi^* - \phi)(f_1 - kf_0) \geq 0 everywhere. The second integral is k(αE0[ϕ])0k(\alpha - \mathbb{E}_0[\phi]) \geq 0 since ϕ\phi has level α\leq \alpha. Both terms are nonneg, so E1[ϕ]E1[ϕ]\mathbb{E}_1[\phi^*] \geq \mathbb{E}_1[\phi].

Why It Matters

The Neyman-Pearson lemma is one of the cleanest optimality results in statistics. It says the optimal test statistic is the likelihood ratio, period. Every commonly used test (t-test, z-test, chi-squared test) can be understood as a likelihood ratio test for a specific distributional assumption.

Failure Mode

The lemma applies only to simple hypotheses (point null vs. point alternative). For composite hypotheses (H0:θθ0H_0: \theta \leq \theta_0), the most powerful test depends on which θ1\theta_1 you want power against, and a uniformly most powerful test may not exist.

Uniformly Most Powerful Tests

Theorem

UMP Tests via Monotone Likelihood Ratio

Statement

If the family {fθ}\{f_\theta\} has a monotone likelihood ratio in T(X)T(X) (i.e., fθ1(x)/fθ0(x)f_{\theta_1}(x)/f_{\theta_0}(x) is nondecreasing in T(x)T(x) for θ1>θ0\theta_1 > \theta_0), then for testing H0:θθ0H_0: \theta \leq \theta_0 versus H1:θ>θ0H_1: \theta > \theta_0, the test that rejects for large T(X)T(X):

ϕ(x)={1if T(x)>cγif T(x)=c0if T(x)<c\phi^*(x) = \begin{cases} 1 & \text{if } T(x) > c \\ \gamma & \text{if } T(x) = c \\ 0 & \text{if } T(x) < c \end{cases}

where cc and γ\gamma give size α\alpha, is uniformly most powerful (UMP). It has the highest power against every θ1>θ0\theta_1 > \theta_0 simultaneously.

Intuition

When the likelihood ratio is monotone in T(X)T(X), the Neyman-Pearson test for any specific θ1>θ0\theta_1 > \theta_0 always rejects for large T(X)T(X). Since the test does not depend on which θ1\theta_1 we target, it is simultaneously most powerful against all alternatives on one side. Exponential families always have monotone likelihood ratio in their natural sufficient statistic.

Proof Sketch

For any θ1>θ0\theta_1 > \theta_0, the Neyman-Pearson test rejects when fθ1/fθ0>kf_{\theta_1}/f_{\theta_0} > k. By the monotone likelihood ratio property, this is equivalent to T(x)>c(θ1)T(x) > c(\theta_1). But the size constraint Eθ0[ϕ(X)]=α\mathbb{E}_{\theta_0}[\phi(X)] = \alpha determines cc uniquely, so c(θ1)=cc(\theta_1) = c for all θ1\theta_1. The same test is most powerful for every θ1>θ0\theta_1 > \theta_0.

Why It Matters

UMP tests exist only in restricted settings (one-parameter families with one-sided alternatives). For two-sided alternatives or multiparameter families, UMP tests typically do not exist, and one must settle for locally most powerful or likelihood ratio tests.

Failure Mode

For two-sided alternatives (H1:θθ0H_1: \theta \neq \theta_0), no UMP test exists in general. The Neyman-Pearson test for θ1>θ0\theta_1 > \theta_0 differs from the test for θ1<θ0\theta_1 < \theta_0. Common practice uses the two-sided likelihood ratio test, which is not UMP but is unbiased.

Connection to Binary Classification

Hypothesis testing and binary classification solve the same problem: given an observation xx, decide between two classes. The Neyman-Pearson lemma says the optimal decision boundary is a level set of the likelihood ratio f1(x)/f0(x)f_1(x)/f_0(x). This is equivalent to the Bayes-optimal classifier when the class priors are adjusted to match the significance level.

Specifically: the ROC curve of the likelihood ratio classifier dominates the ROC curve of any other classifier. Every point on the ROC curve corresponds to a Neyman-Pearson test at a different level α\alpha.

Common Confusions

Watch Out

Power is not 1 minus the p-value

The p-value is a random variable computed from data. Power is a fixed property of the test design, computed before seeing data. Power is Pθ1(reject H0)P_{\theta_1}(\text{reject } H_0) for a specific alternative θ1\theta_1. The p-value is Pθ0(TTobs)P_{\theta_0}(T \geq T_{\text{obs}}). They measure different things.

Watch Out

A test can be most powerful and still have low power

The Neyman-Pearson lemma says the likelihood ratio test is the best among all level-α\alpha tests. It does not say the power is high. If the sample size is small or θ1\theta_1 is close to θ0\theta_0, even the most powerful test may have low power. "Most powerful" is a relative statement, not an absolute one.

Canonical Examples

Example

Testing a Gaussian mean

Let X1,,XnN(μ,σ2)X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2) with known σ2\sigma^2. Test H0:μ=0H_0: \mu = 0 vs H1:μ=1H_1: \mu = 1. The likelihood ratio is exp(nXˉ/σ2n/(2σ2))\exp(n\bar{X}/\sigma^2 - n/(2\sigma^2)), which is monotone increasing in Xˉ\bar{X}. The Neyman-Pearson test rejects when Xˉ>σzα/n\bar{X} > \sigma z_\alpha / \sqrt{n} where zαz_\alpha is the standard normal quantile. Power at μ=1\mu = 1 is Φ(n/σzα)\Phi(\sqrt{n}/\sigma - z_\alpha). For n=25n = 25 and σ=2\sigma = 2, power is Φ(2.51.645)Φ(0.855)0.80\Phi(2.5 - 1.645) \approx \Phi(0.855) \approx 0.80.

Key Takeaways

  • The Neyman-Pearson lemma: the likelihood ratio test is the most powerful test for simple hypotheses
  • UMP tests exist for one-sided alternatives in exponential families via the monotone likelihood ratio
  • The power function βϕ(θ)\beta_\phi(\theta) characterizes a test across the entire parameter space
  • The ROC curve of the likelihood ratio classifier dominates all other classifiers
  • UMP tests do not exist for two-sided alternatives in general

Exercises

ExerciseCore

Problem

Let XBernoulli(p)X \sim \text{Bernoulli}(p) with a single observation. For testing H0:p=0.5H_0: p = 0.5 vs H1:p=0.8H_1: p = 0.8 at level α=0.5\alpha = 0.5, write down the Neyman-Pearson test and compute its power.

ExerciseAdvanced

Problem

Prove that for testing H0:μ=0H_0: \mu = 0 vs H1:μ0H_1: \mu \neq 0 with XN(μ,1)X \sim \mathcal{N}(\mu, 1) (single observation), no UMP level-α\alpha test exists.

References

Canonical:

  • Lehmann & Romano, Testing Statistical Hypotheses (3rd ed., 2005), Chapters 3-4
  • Casella & Berger, Statistical Inference (2nd ed., 2002), Chapter 8

Current:

  • Wasserman, All of Statistics (2004), Chapter 10

  • van der Vaart, Asymptotic Statistics (1998), Chapters 2-8

  • Keener, Theoretical Statistics (2010), Chapters 3-8

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics