Applied Math
Signal Detection Theory
The mathematical framework for binary decisions under noise. ROC curves, d-prime, likelihood ratios, the Neyman-Pearson lemma connection, and why SDT is the foundation of both psychophysics and ML classification evaluation.
Why This Matters
Every binary classifier faces the same problem: given a noisy observation, decide whether it came from the "signal" class or the "noise" class. Signal detection theory (SDT) provides the complete mathematical framework for this decision. ROC curves, AUC, sensitivity (), and the likelihood ratio test all originate here. SDT was developed in the 1950s as part of detection theory for radar operators deciding whether a blip on the screen was an enemy aircraft or noise. The same mathematics now governs medical diagnosis, spam filtering, and every ML classification metric in confusion matrices. Understanding SDT clarifies why the ROC curve works, what AUC actually measures, and when precision-recall is preferable.
The Basic Model
An observer receives a scalar observation drawn from one of two distributions:
- Noise alone (, signal absent):
- Signal plus noise (, signal present):
The observer sets a criterion and responds "signal present" if , "signal absent" otherwise.
The Four Outcomes
Given a binary decision and ground truth, there are four possible outcomes:
| Signal present () | Signal absent () | |
|---|---|---|
| Respond "yes" | Hit (true positive) | False alarm (false positive) |
| Respond "no" | Miss (false negative) | Correct rejection (true negative) |
The hit rate is . The false alarm rate is . These are the fundamental operating characteristics.
Sensitivity: (d-prime)
d-prime
When both distributions are Gaussian with equal variance :
the sensitivity is the standardized distance between the means:
means the signal and noise distributions are identical (chance performance). Larger means the signal is more discriminable from noise. is independent of the criterion , measuring the observer's ability to discriminate regardless of bias.
Criterion and Bias
The criterion is the threshold on the observation axis. The observer responds "signal" when . The criterion determines the tradeoff between hits and false alarms:
- Liberal criterion (low ): high hit rate, high false alarm rate
- Conservative criterion (high ): low hit rate, low false alarm rate
The bias is defined as the likelihood ratio at the criterion:
An unbiased observer sets (criterion at the intersection of the two distributions). is conservative, is liberal.
The Likelihood Ratio
Likelihood Ratio
The likelihood ratio for observation is:
This is the ratio of the probability of observing under signal-present versus signal-absent. The likelihood ratio is a sufficient statistic for the binary decision: all information relevant to the decision is captured by .
For the equal-variance Gaussian case:
Since is a monotone function of in this case, thresholding is equivalent to thresholding . In general (non-Gaussian, unequal variance), the optimal decision rule thresholds , not directly.
The Neyman-Pearson Lemma
Neyman-Pearson Lemma
Statement
Among all decision rules with false alarm rate at most , the rule that maximizes the hit rate (power) is the likelihood ratio test: respond "signal" if
where is chosen so that .
No other test with the same false alarm rate can achieve a higher hit rate.
Intuition
The likelihood ratio ranks observations by how much more likely they are under than under . If you can only afford a false alarm rate of , you should spend your "budget" on the observations most indicative of . The likelihood ratio test does exactly this: it rejects for the observations with the strongest evidence for .
Proof Sketch
Let be the likelihood ratio test with threshold achieving false alarm rate , and let be any other test with false alarm rate at most . Consider the difference in power:
On the region where (i.e., ), we have , so . On the region where , we have , so again . Integrating:
The last inequality holds because has false alarm rate at most , so . Therefore the power of is at least the power of .
Why It Matters
The Neyman-Pearson lemma is the theoretical foundation of ROC analysis. Each point on the ROC curve corresponds to a specific threshold on the likelihood ratio. The ROC curve traces out the optimal tradeoff between hit rate and false alarm rate. Any decision rule that does not use the likelihood ratio is suboptimal: it achieves a point below the ROC curve. In ML, a classifier's predicted probability (when well-calibrated) approximates the likelihood ratio, and thresholding it traces the ROC curve.
Failure Mode
The lemma requires known, fully specified distributions and (simple hypotheses). When the distributions have unknown parameters (composite hypotheses), the likelihood ratio test is no longer uniformly most powerful. In ML, the true class-conditional distributions are unknown, so ROC curves are estimated empirically. The lemma guarantees optimality in the idealized setting; in practice, the quality of the ROC depends on how well the classifier's scores approximate the true likelihood ratio.
ROC Curves
Receiver Operating Characteristic (ROC) Curve
The ROC curve plots the hit rate (true positive rate) against the false alarm rate (false positive rate) as the criterion varies from to :
- -axis: false alarm rate
- -axis: hit rate
A perfect discriminator has an ROC curve passing through (zero false alarms, 100% hit rate). A random guesser lies on the diagonal from to .
For the equal-variance Gaussian model, the ROC curve has a closed-form parameterization. Let denote the standard normal CDF and its inverse. If the false alarm rate is , then:
The ROC curve bows toward the upper-left corner. The larger , the more the curve bows, indicating better discrimination.
Area Under the ROC Curve (AUC)
The AUC is the integral of the ROC curve over :
AUC has a probabilistic interpretation: it equals the probability that a randomly chosen signal observation scores higher than a randomly chosen noise observation:
For the equal-variance Gaussian model: .
AUC as Concordance Probability
Statement
Let and be independent draws from the signal and noise distributions respectively. Then:
This equals the Wilcoxon-Mann-Whitney statistic normalized to .
Intuition
AUC measures how well the scoring function separates the two classes. If every signal observation scores higher than every noise observation, . If scores are completely random with respect to class, . AUC is the probability that the classifier correctly ranks a random signal-noise pair.
Proof Sketch
Write . By change of variables with :
The equivalence to the Mann-Whitney statistic follows from the fact that is the empirical estimate of computed over all pairs.
Why It Matters
This interpretation explains why AUC is threshold-independent: it averages over all possible operating points. In ML, AUC is the standard metric when the cost of false positives versus false negatives is unknown or when different deployment scenarios require different thresholds. AUC of 0.5 means the model is no better than random; AUC of 1.0 means perfect separation.
Failure Mode
AUC weights all thresholds equally, including regions of very high false alarm rate that are irrelevant in practice. If you care only about performance at low false positive rates (common in medical screening, fraud detection), the partial AUC over the relevant FPR range is more informative. Precision-recall curves are preferred when the negative class vastly outnumbers the positive class, because AUC can be misleadingly high when most predictions are correct simply by predicting the majority class.
Origins: Psychophysics
SDT was formalized by Green and Swets (1966) for psychophysics experiments. A classic example: an observer listens to intervals of noise and must decide whether a faint tone was present. The observer's internal representation is a noisy scalar, and measures perceptual sensitivity independent of the observer's willingness to say "yes." Before SDT, psychophysics conflated sensitivity with bias. Two observers with the same perceptual ability but different response biases (one cautious, one trigger-happy) would appear to have different detection thresholds. SDT separated these two factors cleanly.
Connection to ML Classification
Modern ML evaluation is a direct descendant of SDT:
- A classifier's predicted probability or score plays the role of the observation
- The ROC curve is constructed by sweeping the classification threshold
- AUC measures overall discrimination ability, analogous to a nonparametric measure of
- Precision and recall are SDT concepts restricted to the positive predictions
- The Neyman-Pearson lemma explains why the likelihood ratio (or a monotone transform of it, like calibrated probabilities) is the optimal scoring function
The key insight: a well-calibrated classifier with as its score function is computing the posterior, which is a monotone function of the likelihood ratio when the prior is fixed. Thresholding this posterior at different values traces the ROC curve.
Common Confusions
d-prime requires equal-variance Gaussian assumption
is defined as only when both distributions are Gaussian with the same variance. If the variances differ, the ROC curve on normal-normal axes (zROC) is a straight line with slope , and no longer fully characterizes performance. In ML, the equal-variance assumption rarely holds, so AUC (which is nonparametric) is preferred over .
High AUC does not mean the classifier is useful at your operating point
AUC averages over all thresholds. A classifier with AUC = 0.95 might perform poorly at the specific false positive rate your application requires. Always examine the ROC curve (or precision-recall curve) at the operating point relevant to your deployment, not just the aggregate AUC.
ROC curves can be misleading with class imbalance
With extreme class imbalance (e.g., 1% positive, 99% negative), a classifier can achieve high AUC by ranking well without achieving useful precision. A model that assigns slightly higher scores to the rare positive class achieves good AUC, but when you threshold to get high recall, precision may be very low. Use precision-recall curves for imbalanced problems.
Exercises
Problem
Two Gaussian distributions have , , and . Compute . If the criterion is set at (midpoint), compute the hit rate and false alarm rate.
Problem
A binary classifier assigns scores to 5 positive and 5 negative examples. The positive scores are and the negative scores are . Compute the AUC using the concordance interpretation.
Problem
Show that for the equal-variance Gaussian model, . Start from the concordance definition where and .
Problem
An observer in a psychophysics experiment has and sets a criterion corresponding to (conservative). For the equal-variance Gaussian model with , find the criterion location , the hit rate, and the false alarm rate.
References
Canonical:
- Green & Swets, Signal Detection Theory and Psychophysics (1966), Chapters 1-4
- Macmillan & Creelman, Detection Theory: A User's Guide (2nd ed., 2004), Chapters 1-3
Current:
- Fawcett, "An Introduction to ROC Analysis" (Pattern Recognition Letters, 2006)
- Saito & Rehmsmeier, "The Precision-Recall Plot Is More Informative than the ROC Plot" (PLOS ONE, 2015)
- Hand, "Measuring Classifier Performance: A Coherent Alternative to the Area Under the ROC Curve" (Machine Learning, 2009)
- Wickens, Elementary Signal Detection Theory (2002), Chapters 2-5
Next Topics
- Confusion matrices and classification metrics: the full taxonomy of classification evaluation metrics, directly built on SDT concepts
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Hypothesis Testing for MLLayer 2