Detection Theory

Sneiderman, Robby

Statistical Foundations

Detection Theory

Binary hypothesis testing, the Neyman-Pearson lemma (likelihood ratio tests are most powerful), ROC curves, Bayesian detection, and sequential testing. Classification IS detection theory. ROC/AUC comes directly from here.

AdvancedTier 2StableSupporting~55 min

Prerequisites

Hypothesis Testing for ML Bayesian Estimation

Start 8-question practice · 1 available Prereq Map

Learning position

Read this page in the graph.

statistical-foundations | layer 2 | tier 2. This page has 2 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Every binary classifier is a detector. When your spam filter decides whether an email is spam or not-spam, it is performing binary hypothesis testing. When you plot an ROC curve and compute AUC, you are using the exact framework developed by Neyman, Pearson, and Wald in the 1930s-40s.

Detection theory gives you the optimal answer to the classification question: given a probabilistic model of the data (see common probability distributions), what is the best possible classifier, and how do you build it? The answer is the likelihood ratio test, and the Neyman-Pearson lemma proves it is optimal. Everything in ML classification. from logistic regression to neural networks. is trying to approximate this optimal detector.

Mental Model

You observe data $x$ and must decide between two hypotheses: $H_0$ (nothing interesting is happening) and $H_1$ (something is happening). You will make errors either way: false alarms (saying $H_1$ when $H_0$ is true) and misses (saying $H_0$ when $H_1$ is true). Detection theory asks: what is the best tradeoff between these two types of errors?

The answer is remarkably clean: compute the likelihood ratio $p(x|H_1)/p(x|H_0)$ and compare it to a threshold. Every optimal detector has this form.

Formal Setup and Notation

Definition

Binary Hypothesis Testing

Given an observation $x$ , decide between:

$H_0: x \sim p_0(x) \quad \text{vs} \quad H_1: x \sim p_1(x)$

where $p_0$ and $p_1$ are known probability densities (or mass functions).

A decision rule (detector) $\delta: \mathcal{X} \to \{0, 1\}$ maps observations to decisions. $\delta(x) = 1$ means "decide $H_1$ ."

Definition

Error Probabilities

Probability of false alarm (Type I error, false positive rate):

$P_F = P(\delta(x) = 1 \mid H_0) = \int_{\{x: \delta(x)=1\}} p_0(x) \, dx$

Probability of miss (Type II error, false negative rate):

$P_M = P(\delta(x) = 0 \mid H_1) = \int_{\{x: \delta(x)=0\}} p_1(x) \, dx$

Probability of detection (power, true positive rate, sensitivity):

$P_D = 1 - P_M = P(\delta(x) = 1 \mid H_1)$

Definition

Likelihood Ratio

The likelihood ratio for observation $x$ is:

$L(x) = \frac{p_1(x)}{p_0(x)}$

A likelihood ratio test (LRT) with threshold $\eta$ decides $H_1$ if and only if $L(x) > \eta$ and $H_0$ if and only if $L(x) < \eta$ .

Main Theorems

Lemma

Neyman-Pearson Lemma

Statement

For testing $H_0: x \sim p_0$ vs $H_1: x \sim p_1$ , among all decision rules $\delta$ with false alarm probability $P_F \leq \alpha$ , the likelihood ratio test

$\delta^*(x) = \begin{cases} 1 & \text{if } L(x) > \eta \\ 0 & \text{if } L(x) < \eta \end{cases}$

where $\eta$ is chosen so that $P_F = \alpha$ , achieves the maximum probability of detection $P_D$ . No other test with the same false alarm constraint can have higher power.

Intuition

The likelihood ratio $L(x) = p_1(x)/p_0(x)$ measures how much more likely the observation is under $H_1$ than under $H_0$ . It is natural to decide $H_1$ when this ratio is large. The Neyman-Pearson lemma says this intuition is not just natural but provably optimal. You cannot do better.

Proof Sketch

Let $\delta^*$ be the LRT and $\delta$ be any other test with $P_F \leq \alpha$ . Consider the difference in detection probabilities:

$P_D^* - P_D = \int (\delta^*(x) - \delta(x)) p_1(x) \, dx$

In the region where $\delta^*(x) = 1$ but $\delta(x) = 0$ , we have $L(x) > \eta$ , so $p_1(x) > \eta p_0(x)$ . In the region where $\delta^*(x) = 0$ but $\delta(x) = 1$ , we have $L(x) < \eta$ , so $p_1(x) < \eta p_0(x)$ . Using the constraint that both tests have $P_F \leq \alpha$ , the positive contribution dominates, giving $P_D^* \geq P_D$ .

Why It Matters

The Neyman-Pearson lemma is the theoretical foundation of all classification metrics. It tells us that the ROC curve of the LRT dominates the ROC curve of every other test. In ML terms: if you know the true class-conditional densities, the optimal classifier is the LRT, and every ML model is trying to learn an approximation of it.

Failure Mode

The lemma assumes you know $p_0$ and $p_1$ exactly. In ML, you never know the true distributions. You learn an approximation from data. The gap between the optimal LRT and a learned classifier is the price of not knowing the true distributions.

report a correction →

ROC Curves

Definition

Receiver Operating Characteristic (ROC) Curve

The ROC curve plots $P_D$ (true positive rate) against $P_F$ (false positive rate) as the threshold $\eta$ varies from $\infty$ to $0$ .

$\text{ROC}: \quad (P_F(\eta), P_D(\eta)) \quad \text{for } \eta \in [0, \infty)$

The area under the ROC curve (AUC) equals the probability that a randomly chosen positive example has a higher score than a randomly chosen negative example:

$\text{AUC} = P(L(x_1) > L(x_0) \mid x_1 \sim p_1, x_0 \sim p_0)$

Properties of ROC curves:

A random classifier gives the diagonal line from $(0,0)$ to $(1,1)$
A perfect classifier gives the point $(0,1)$
The Neyman-Pearson lemma guarantees the LRT produces the highest ROC curve
ROC curves are invariant to monotone transformations of the score (you can use $\log L(x)$ instead of $L(x)$ )
AUC $= 0.5$ for random, AUC $= 1.0$ for perfect classification

Bayesian Detection

The Bayesian approach connects detection theory to Bayesian estimation by incorporating prior information and costs.

Definition

Bayesian Decision Rule

When prior probabilities $\pi_0 = P(H_0)$ and $\pi_1 = P(H_1)$ and costs $C_{ij}$ (cost of deciding $H_i$ when $H_j$ is true) are known, the Bayes-optimal detector minimizes the expected cost (Bayes risk):

$\delta_B(x) = \begin{cases} 1 & \text{if } L(x) > \frac{\pi_0(C_{10} - C_{00})}{\pi_1(C_{01} - C_{11})} \\ 0 & \text{otherwise} \end{cases}$

For equal costs ( $C_{10} = C_{01} = 1$ , $C_{00} = C_{11} = 0$ ), this simplifies to the MAP rule: decide $H_1$ if and only if $\pi_1 p_1(x) > \pi_0 p_0(x)$ .

The key difference between Neyman-Pearson and Bayesian approaches:

Neyman-Pearson: fix the false alarm rate, maximize detection probability. No priors needed.
Bayesian: minimize expected cost given priors and cost structure. Requires knowing (or estimating) priors and costs.

Sequential Detection

Theorem

Wald's Sequential Probability Ratio Test

Statement

The Sequential Probability Ratio Test (SPRT) observes samples $x_1, x_2, \ldots$ one at a time and computes the cumulative log-likelihood ratio:

$S_n = \sum_{i=1}^{n} \log \frac{p_1(x_i)}{p_0(x_i)}$

The test stops and decides $H_1$ when $S_n \geq \log B$ , decides $H_0$ when $S_n \leq \log A$ , and continues sampling otherwise. With thresholds $A = \beta/(1-\alpha)$ and $B = (1-\beta)/\alpha$ , the SPRT achieves error probabilities $P_F \leq \alpha$ and $P_M \leq \beta$ .

Among all sequential tests with error probabilities $\leq \alpha$ and $\leq \beta$ , the SPRT minimizes the expected number of samples under both hypotheses.

Intuition

Instead of collecting a fixed number of samples and then deciding, the SPRT accumulates evidence and stops as soon as it has enough. Easy cases (strong evidence for one hypothesis) terminate quickly. Hard cases (ambiguous evidence) require more samples. On average, the SPRT needs fewer samples than any fixed-sample-size test with the same error guarantees.

Proof Sketch

Wald's proof uses the optional stopping theorem for the likelihood ratio martingale. Under $H_0$ , the likelihood ratio $\prod p_1(x_i)/p_0(x_i)$ has expectation 1 at any stopping time. The threshold structure ensures the error bounds hold. Optimality (minimizing expected sample size) follows from the fact that the SPRT boundaries are the tightest boundaries consistent with the error constraints.

Why It Matters

The SPRT is used in A/B testing (sequential experimentation), clinical trials (early stopping for efficacy or futility), quality control, and anomaly detection. In ML, it provides the theoretical basis for sequential model evaluation: stop testing as soon as you have enough evidence that one model is better than another.

Failure Mode

The SPRT is optimal for simple hypotheses. When hypotheses are composite (e.g., testing $\mu = 0$ vs $\mu > 0$ for unknown $\mu$ ), the simple SPRT does not directly apply, and you need generalized sequential tests.

report a correction →

Connection to ML Classification

The connection between detection theory and ML classification is direct:

Detection Theory	ML Classification
$H_0$ vs $H_1$	Negative vs Positive class
Likelihood ratio test	Optimal Bayes classifier
$P_F$ (false alarm)	False positive rate (1 - specificity)
$P_D$ (detection)	True positive rate (sensitivity/recall)
ROC curve	ROC curve (same thing)
Neyman-Pearson	Threshold selection at fixed FPR
Bayesian detection	Classification with class priors and costs

Logistic regression directly estimates the log-likelihood ratio: $\log p(y=1|x)/p(y=0|x)$ . Neural network classifiers with softmax output learn approximations to the posterior $p(y|x)$ , from which the likelihood ratio can be recovered.

Common Confusions

Watch Out

ROC vs precision-recall curves

ROC curves plot true positive rate vs false positive rate. Precision-recall curves plot precision vs recall. ROC curves are preferred when classes are balanced; precision-recall curves are more informative when the positive class is rare (because precision is sensitive to class imbalance while the false positive rate is not).

Watch Out

The Neyman-Pearson lemma does not tell you which alpha to use

The lemma says: for any fixed $\alpha$ , the LRT is optimal. But choosing $\alpha$ is a separate decision that depends on the application. In medical screening, you want low $P_M$ (do not miss disease), so you accept higher $P_F$ . In criminal justice, you want low $P_F$ (do not convict the innocent), so you accept higher $P_M$ .

Summary

Binary classification is binary hypothesis testing
The likelihood ratio test is the optimal detector (Neyman-Pearson lemma)
ROC curves arise from varying the LRT threshold
AUC = probability of ranking a positive higher than a negative
Bayesian detection incorporates priors and costs into the threshold
The SPRT is the optimal sequential test: stop as soon as you have enough evidence
Every ML classifier is trying to approximate the likelihood ratio

Exercises

ExerciseCore

Problem

Two classes have Gaussian distributions: $H_0: x \sim \mathcal{N}(0, 1)$ and $H_1: x \sim \mathcal{N}(2, 1)$ . Derive the likelihood ratio test and find the threshold $\eta$ for $P_F = 0.05$ .

ExerciseAdvanced

Problem

Prove that AUC equals $P(L(x_1) > L(x_0))$ where $x_1 \sim p_1$ and $x_0 \sim p_0$ are independent.

ExerciseResearch

Problem

In an A/B test, you want to determine whether a new model has higher accuracy than the baseline. How would you apply the SPRT to enable early stopping? What are the practical considerations?

References

Canonical:

Kay, Fundamentals of Statistical Signal Processing, Vol. II: Detection Theory (1998)
Van Trees, Detection, Estimation, and Modulation Theory, Part I (2001)

Current:

Poor, An Introduction to Signal Detection and Estimation (1994)
Fawcett, "An Introduction to ROC Analysis" (2006)
Casella & Berger, Statistical Inference (2002), Chapters 5-10
Lehmann & Casella, Theory of Point Estimation (1998), Chapters 1-6

Next Topics

Detection theory connects to:

Calibration and uncertainty: ensuring classifier probabilities match empirical frequencies

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Bayesian Estimationlayer 0B · tier 2
Hypothesis Testing for MLlayer 2 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.