Statistical Foundations
Detection Theory
Binary hypothesis testing, the Neyman-Pearson lemma (likelihood ratio tests are most powerful), ROC curves, Bayesian detection, and sequential testing. Classification IS detection theory. ROC/AUC comes directly from here.
Prerequisites
Why This Matters
Every binary classifier is a detector. When your spam filter decides whether an email is spam or not-spam, it is performing binary hypothesis testing. When you plot an ROC curve and compute AUC, you are using the exact framework developed by Neyman, Pearson, and Wald in the 1930s-40s.
Detection theory gives you the optimal answer to the classification question: given a probabilistic model of the data (see common probability distributions), what is the best possible classifier, and how do you build it? The answer is the likelihood ratio test, and the Neyman-Pearson lemma proves it is optimal. Everything in ML classification. from logistic regression to neural networks. is trying to approximate this optimal detector.
Mental Model
You observe data and must decide between two hypotheses: (nothing interesting is happening) and (something is happening). You will make errors either way: false alarms (saying when is true) and misses (saying when is true). Detection theory asks: what is the best tradeoff between these two types of errors?
The answer is remarkably clean: compute the likelihood ratio and compare it to a threshold. Every optimal detector has this form.
Formal Setup and Notation
Binary Hypothesis Testing
Given an observation , decide between:
where and are known probability densities (or mass functions).
A decision rule (detector) maps observations to decisions. means "decide ."
Error Probabilities
Probability of false alarm (Type I error, false positive rate):
Probability of miss (Type II error, false negative rate):
Probability of detection (power, true positive rate, sensitivity):
Likelihood Ratio
The likelihood ratio for observation is:
A likelihood ratio test (LRT) with threshold decides if and if .
Main Theorems
Neyman-Pearson Lemma
Statement
For testing vs , among all decision rules with false alarm probability , the likelihood ratio test
where is chosen so that , achieves the maximum probability of detection . No other test with the same false alarm constraint can have higher power.
Intuition
The likelihood ratio measures how much more likely the observation is under than under . It is natural to decide when this ratio is large. The Neyman-Pearson lemma says this intuition is not just natural but provably optimal. You cannot do better.
Proof Sketch
Let be the LRT and be any other test with . Consider the difference in detection probabilities:
In the region where but , we have , so . In the region where but , we have , so . Using the constraint that both tests have , the positive contribution dominates, giving .
Why It Matters
The Neyman-Pearson lemma is the theoretical foundation of all classification metrics. It tells us that the ROC curve of the LRT dominates the ROC curve of every other test. In ML terms: if you know the true class-conditional densities, the optimal classifier is the LRT, and every ML model is trying to learn an approximation of it.
Failure Mode
The lemma assumes you know and exactly. In ML, you never know the true distributions. You learn an approximation from data. The gap between the optimal LRT and a learned classifier is the price of not knowing the true distributions.
ROC Curves
Receiver Operating Characteristic (ROC) Curve
The ROC curve plots (true positive rate) against (false positive rate) as the threshold varies from to .
The area under the ROC curve (AUC) equals the probability that a randomly chosen positive example has a higher score than a randomly chosen negative example:
Properties of ROC curves:
- A random classifier gives the diagonal line from to
- A perfect classifier gives the point
- The Neyman-Pearson lemma guarantees the LRT produces the highest ROC curve
- ROC curves are invariant to monotone transformations of the score (you can use instead of )
- AUC for random, AUC for perfect classification
Bayesian Detection
The Bayesian approach connects detection theory to Bayesian estimation by incorporating prior information and costs.
Bayesian Decision Rule
When prior probabilities and and costs (cost of deciding when is true) are known, the Bayes-optimal detector minimizes the expected cost (Bayes risk):
For equal costs (, ), this simplifies to the MAP rule: decide if .
The key difference between Neyman-Pearson and Bayesian approaches:
- Neyman-Pearson: fix the false alarm rate, maximize detection probability. No priors needed.
- Bayesian: minimize expected cost given priors and cost structure. Requires knowing (or estimating) priors and costs.
Sequential Detection
Wald's Sequential Probability Ratio Test
Statement
The Sequential Probability Ratio Test (SPRT) observes samples one at a time and computes the cumulative log-likelihood ratio:
The test stops and decides when , decides when , and continues sampling otherwise. With thresholds and , the SPRT achieves error probabilities and .
Among all sequential tests with error probabilities and , the SPRT minimizes the expected number of samples under both hypotheses.
Intuition
Instead of collecting a fixed number of samples and then deciding, the SPRT accumulates evidence and stops as soon as it has enough. Easy cases (strong evidence for one hypothesis) terminate quickly. Hard cases (ambiguous evidence) require more samples. On average, the SPRT needs fewer samples than any fixed-sample-size test with the same error guarantees.
Proof Sketch
Wald's proof uses the optional stopping theorem for the likelihood ratio martingale. Under , the likelihood ratio has expectation 1 at any stopping time. The threshold structure ensures the error bounds hold. Optimality (minimizing expected sample size) follows from the fact that the SPRT boundaries are the tightest boundaries consistent with the error constraints.
Why It Matters
The SPRT is used in A/B testing (sequential experimentation), clinical trials (early stopping for efficacy or futility), quality control, and anomaly detection. In ML, it provides the theoretical basis for sequential model evaluation: stop testing as soon as you have enough evidence that one model is better than another.
Failure Mode
The SPRT is optimal for simple hypotheses. When hypotheses are composite (e.g., testing vs for unknown ), the simple SPRT does not directly apply, and you need generalized sequential tests.
Connection to ML Classification
The connection between detection theory and ML classification is direct:
| Detection Theory | ML Classification |
|---|---|
| vs | Negative vs Positive class |
| Likelihood ratio test | Optimal Bayes classifier |
| (false alarm) | False positive rate (1 - specificity) |
| (detection) | True positive rate (sensitivity/recall) |
| ROC curve | ROC curve (same thing) |
| Neyman-Pearson | Threshold selection at fixed FPR |
| Bayesian detection | Classification with class priors and costs |
Logistic regression directly estimates the log-likelihood ratio: . Neural network classifiers with softmax output learn approximations to the posterior , from which the likelihood ratio can be recovered.
Common Confusions
ROC vs precision-recall curves
ROC curves plot true positive rate vs false positive rate. Precision-recall curves plot precision vs recall. ROC curves are preferred when classes are balanced; precision-recall curves are more informative when the positive class is rare (because precision is sensitive to class imbalance while the false positive rate is not).
The Neyman-Pearson lemma does not tell you which alpha to use
The lemma says: for any fixed , the LRT is optimal. But choosing is a separate decision that depends on the application. In medical screening, you want low (do not miss disease), so you accept higher . In criminal justice, you want low (do not convict the innocent), so you accept higher .
Summary
- Binary classification is binary hypothesis testing
- The likelihood ratio test is the optimal detector (Neyman-Pearson lemma)
- ROC curves arise from varying the LRT threshold
- AUC = probability of ranking a positive higher than a negative
- Bayesian detection incorporates priors and costs into the threshold
- The SPRT is the optimal sequential test: stop as soon as you have enough evidence
- Every ML classifier is trying to approximate the likelihood ratio
Exercises
Problem
Two classes have Gaussian distributions: and . Derive the likelihood ratio test and find the threshold for .
Problem
Prove that AUC equals where and are independent.
Problem
In an A/B test, you want to determine whether a new model has higher accuracy than the baseline. How would you apply the SPRT to enable early stopping? What are the practical considerations?
References
Canonical:
- Kay, Fundamentals of Statistical Signal Processing, Vol. II: Detection Theory (1998)
- Van Trees, Detection, Estimation, and Modulation Theory, Part I (2001)
Current:
-
Poor, An Introduction to Signal Detection and Estimation (1994)
-
Fawcett, "An Introduction to ROC Analysis" (2006)
-
Casella & Berger, Statistical Inference (2002), Chapters 5-10
-
Lehmann & Casella, Theory of Point Estimation (1998), Chapters 1-6
Next Topics
Detection theory connects to:
- Calibration and uncertainty: ensuring classifier probabilities match empirical frequencies
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Hypothesis Testing for MLLayer 2
- Bayesian EstimationLayer 0B
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Differentiation in RnLayer 0A