Statistical Foundations
Neyman-Pearson and Hypothesis Testing Theory
The likelihood ratio test is the most powerful test for simple hypotheses (Neyman-Pearson lemma), UMP tests extend this to one-sided composites, and the power function characterizes a test's behavior across the parameter space.
Why This Matters
Hypothesis testing is the formal framework for making binary decisions from data: is the drug effective? Is the model better than the baseline? The Neyman-Pearson lemma answers a precise optimization question: among all tests with Type I error at most , which test has the highest probability of correctly rejecting a false null hypothesis?
The answer is the likelihood ratio test. This result is the foundation for understanding power analysis, sample size calculations, and the deep connection between hypothesis testing and binary classification.
Formal Setup
Hypothesis Test
A hypothesis test for versus is a function where is the probability of rejecting given data . The size (Type I error rate) is . The power against is .
A test has level if .
Power Function
The power function of a test is:
For a good test, (controlled Type I error) and is large for far from (high power against alternatives).
Main Theorems
Neyman-Pearson Lemma
Statement
For testing vs at level , the test that rejects when the likelihood ratio exceeds a threshold:
where and are chosen so that , is the most powerful level- test. That is, for any other test with :
Intuition
The likelihood ratio measures how much more likely the data is under than under . Rejecting when this ratio is large is the optimal strategy for distinguishing the two hypotheses. Data points where is much more likely than provide the strongest evidence against , so the test allocates its rejection budget to these points first.
Proof Sketch
Let be any level- test. We need to show . Write . By construction of : when , ; when , . So everywhere. The second integral is since has level . Both terms are nonneg, so .
Why It Matters
The Neyman-Pearson lemma is one of the cleanest optimality results in statistics. It says the optimal test statistic is the likelihood ratio, period. Every commonly used test (t-test, z-test, chi-squared test) can be understood as a likelihood ratio test for a specific distributional assumption.
Failure Mode
The lemma applies only to simple hypotheses (point null vs. point alternative). For composite hypotheses (), the most powerful test depends on which you want power against, and a uniformly most powerful test may not exist.
Uniformly Most Powerful Tests
UMP Tests via Monotone Likelihood Ratio
Statement
If the family has a monotone likelihood ratio in (i.e., is nondecreasing in for ), then for testing versus , the test that rejects for large :
where and give size , is uniformly most powerful (UMP). It has the highest power against every simultaneously.
Intuition
When the likelihood ratio is monotone in , the Neyman-Pearson test for any specific always rejects for large . Since the test does not depend on which we target, it is simultaneously most powerful against all alternatives on one side. Exponential families always have monotone likelihood ratio in their natural sufficient statistic.
Proof Sketch
For any , the Neyman-Pearson test rejects when . By the monotone likelihood ratio property, this is equivalent to . But the size constraint determines uniquely, so for all . The same test is most powerful for every .
Why It Matters
UMP tests exist only in restricted settings (one-parameter families with one-sided alternatives). For two-sided alternatives or multiparameter families, UMP tests typically do not exist, and one must settle for locally most powerful or likelihood ratio tests.
Failure Mode
For two-sided alternatives (), no UMP test exists in general. The Neyman-Pearson test for differs from the test for . Common practice uses the two-sided likelihood ratio test, which is not UMP but is unbiased.
Connection to Binary Classification
Hypothesis testing and binary classification solve the same problem: given an observation , decide between two classes. The Neyman-Pearson lemma says the optimal decision boundary is a level set of the likelihood ratio . This is equivalent to the Bayes-optimal classifier when the class priors are adjusted to match the significance level.
Specifically: the ROC curve of the likelihood ratio classifier dominates the ROC curve of any other classifier. Every point on the ROC curve corresponds to a Neyman-Pearson test at a different level .
Common Confusions
Power is not 1 minus the p-value
The p-value is a random variable computed from data. Power is a fixed property of the test design, computed before seeing data. Power is for a specific alternative . The p-value is . They measure different things.
A test can be most powerful and still have low power
The Neyman-Pearson lemma says the likelihood ratio test is the best among all level- tests. It does not say the power is high. If the sample size is small or is close to , even the most powerful test may have low power. "Most powerful" is a relative statement, not an absolute one.
Canonical Examples
Testing a Gaussian mean
Let with known . Test vs . The likelihood ratio is , which is monotone increasing in . The Neyman-Pearson test rejects when where is the standard normal quantile. Power at is . For and , power is .
Key Takeaways
- The Neyman-Pearson lemma: the likelihood ratio test is the most powerful test for simple hypotheses
- UMP tests exist for one-sided alternatives in exponential families via the monotone likelihood ratio
- The power function characterizes a test across the entire parameter space
- The ROC curve of the likelihood ratio classifier dominates all other classifiers
- UMP tests do not exist for two-sided alternatives in general
Exercises
Problem
Let with a single observation. For testing vs at level , write down the Neyman-Pearson test and compute its power.
Problem
Prove that for testing vs with (single observation), no UMP level- test exists.
References
Canonical:
- Lehmann & Romano, Testing Statistical Hypotheses (3rd ed., 2005), Chapters 3-4
- Casella & Berger, Statistical Inference (2nd ed., 2002), Chapter 8
Current:
-
Wasserman, All of Statistics (2004), Chapter 10
-
van der Vaart, Asymptotic Statistics (1998), Chapters 2-8
-
Keener, Theoretical Statistics (2010), Chapters 3-8
Next Topics
- Hypothesis testing for ML: multiple testing, A/B testing, and model comparison
- Bootstrap methods: nonparametric alternatives to parametric tests
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Maximum Likelihood EstimationLayer 0B
- Differentiation in RnLayer 0A