Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Applied Math

Signal Detection Theory

The mathematical framework for binary decisions under noise. ROC curves, d-prime, likelihood ratios, the Neyman-Pearson lemma connection, and why SDT is the foundation of both psychophysics and ML classification evaluation.

CoreTier 2Stable~40 min
0

Why This Matters

Every binary classifier faces the same problem: given a noisy observation, decide whether it came from the "signal" class or the "noise" class. Signal detection theory (SDT) provides the complete mathematical framework for this decision. ROC curves, AUC, sensitivity (dd'), and the likelihood ratio test all originate here. SDT was developed in the 1950s as part of detection theory for radar operators deciding whether a blip on the screen was an enemy aircraft or noise. The same mathematics now governs medical diagnosis, spam filtering, and every ML classification metric in confusion matrices. Understanding SDT clarifies why the ROC curve works, what AUC actually measures, and when precision-recall is preferable.

The Basic Model

An observer receives a scalar observation xx drawn from one of two distributions:

  • Noise alone (H0H_0, signal absent): xf0(x)x \sim f_0(x)
  • Signal plus noise (H1H_1, signal present): xf1(x)x \sim f_1(x)

The observer sets a criterion cc and responds "signal present" if x>cx > c, "signal absent" otherwise.

Definition

The Four Outcomes

Given a binary decision and ground truth, there are four possible outcomes:

Signal present (H1H_1)Signal absent (H0H_0)
Respond "yes"Hit (true positive)False alarm (false positive)
Respond "no"Miss (false negative)Correct rejection (true negative)

The hit rate is P(respond yesH1)P(\text{respond yes} \mid H_1). The false alarm rate is P(respond yesH0)P(\text{respond yes} \mid H_0). These are the fundamental operating characteristics.

Sensitivity: dd' (d-prime)

Definition

d-prime

When both distributions are Gaussian with equal variance σ2\sigma^2:

f0(x)=N(μ0,σ2),f1(x)=N(μ1,σ2)f_0(x) = \mathcal{N}(\mu_0, \sigma^2), \quad f_1(x) = \mathcal{N}(\mu_1, \sigma^2)

the sensitivity dd' is the standardized distance between the means:

d=μ1μ0σd' = \frac{\mu_1 - \mu_0}{\sigma}

d=0d' = 0 means the signal and noise distributions are identical (chance performance). Larger dd' means the signal is more discriminable from noise. dd' is independent of the criterion cc, measuring the observer's ability to discriminate regardless of bias.

Definition

Criterion and Bias

The criterion cc is the threshold on the observation axis. The observer responds "signal" when x>cx > c. The criterion determines the tradeoff between hits and false alarms:

  • Liberal criterion (low cc): high hit rate, high false alarm rate
  • Conservative criterion (high cc): low hit rate, low false alarm rate

The bias β\beta is defined as the likelihood ratio at the criterion:

β=f1(c)f0(c)\beta = \frac{f_1(c)}{f_0(c)}

An unbiased observer sets β=1\beta = 1 (criterion at the intersection of the two distributions). β>1\beta > 1 is conservative, β<1\beta < 1 is liberal.

The Likelihood Ratio

Definition

Likelihood Ratio

The likelihood ratio for observation xx is:

Λ(x)=f1(x)f0(x)\Lambda(x) = \frac{f_1(x)}{f_0(x)}

This is the ratio of the probability of observing xx under signal-present versus signal-absent. The likelihood ratio is a sufficient statistic for the binary decision: all information relevant to the decision is captured by Λ(x)\Lambda(x).

For the equal-variance Gaussian case:

Λ(x)=f1(x)f0(x)=exp((μ1μ0)σ2(xμ1+μ02))\Lambda(x) = \frac{f_1(x)}{f_0(x)} = \exp\left(\frac{(\mu_1 - \mu_0)}{\sigma^2}\left(x - \frac{\mu_1 + \mu_0}{2}\right)\right)

Since Λ(x)\Lambda(x) is a monotone function of xx in this case, thresholding xx is equivalent to thresholding Λ(x)\Lambda(x). In general (non-Gaussian, unequal variance), the optimal decision rule thresholds Λ(x)\Lambda(x), not xx directly.

The Neyman-Pearson Lemma

Lemma

Neyman-Pearson Lemma

Statement

Among all decision rules with false alarm rate at most α\alpha, the rule that maximizes the hit rate (power) is the likelihood ratio test: respond "signal" if

Λ(x)=f1(x)f0(x)>η\Lambda(x) = \frac{f_1(x)}{f_0(x)} > \eta

where η\eta is chosen so that P(Λ(x)>ηH0)=αP(\Lambda(x) > \eta \mid H_0) = \alpha.

No other test with the same false alarm rate can achieve a higher hit rate.

Intuition

The likelihood ratio Λ(x)\Lambda(x) ranks observations by how much more likely they are under H1H_1 than under H0H_0. If you can only afford a false alarm rate of α\alpha, you should spend your "budget" on the observations most indicative of H1H_1. The likelihood ratio test does exactly this: it rejects H0H_0 for the observations with the strongest evidence for H1H_1.

Proof Sketch

Let ϕ\phi^* be the likelihood ratio test with threshold η\eta achieving false alarm rate α\alpha, and let ϕ\phi be any other test with false alarm rate at most α\alpha. Consider the difference in power:

(ϕ(x)ϕ(x))f1(x)dx\int (\phi^*(x) - \phi(x)) f_1(x) \, dx

On the region where ϕ=1\phi^* = 1 (i.e., Λ(x)>η\Lambda(x) > \eta), we have f1(x)>ηf0(x)f_1(x) > \eta f_0(x), so (ϕϕ)f1η(ϕϕ)f0(\phi^* - \phi) f_1 \geq \eta (\phi^* - \phi) f_0. On the region where ϕ=0\phi^* = 0, we have f1(x)ηf0(x)f_1(x) \leq \eta f_0(x), so again (ϕϕ)f1η(ϕϕ)f0(\phi^* - \phi) f_1 \geq \eta (\phi^* - \phi) f_0. Integrating:

(ϕϕ)f1dxη(ϕϕ)f0dx0\int (\phi^* - \phi) f_1 \, dx \geq \eta \int (\phi^* - \phi) f_0 \, dx \geq 0

The last inequality holds because ϕ\phi has false alarm rate at most α=ϕf0dx\alpha = \int \phi^* f_0 \, dx, so (ϕϕ)f0dx0\int (\phi^* - \phi) f_0 \, dx \geq 0. Therefore the power of ϕ\phi^* is at least the power of ϕ\phi.

Why It Matters

The Neyman-Pearson lemma is the theoretical foundation of ROC analysis. Each point on the ROC curve corresponds to a specific threshold η\eta on the likelihood ratio. The ROC curve traces out the optimal tradeoff between hit rate and false alarm rate. Any decision rule that does not use the likelihood ratio is suboptimal: it achieves a point below the ROC curve. In ML, a classifier's predicted probability (when well-calibrated) approximates the likelihood ratio, and thresholding it traces the ROC curve.

Failure Mode

The lemma requires known, fully specified distributions f0f_0 and f1f_1 (simple hypotheses). When the distributions have unknown parameters (composite hypotheses), the likelihood ratio test is no longer uniformly most powerful. In ML, the true class-conditional distributions are unknown, so ROC curves are estimated empirically. The lemma guarantees optimality in the idealized setting; in practice, the quality of the ROC depends on how well the classifier's scores approximate the true likelihood ratio.

ROC Curves

Definition

Receiver Operating Characteristic (ROC) Curve

The ROC curve plots the hit rate (true positive rate) against the false alarm rate (false positive rate) as the criterion cc varies from ++\infty to -\infty:

  • xx-axis: false alarm rate =P(respond yesH0)=cf0(x)dx= P(\text{respond yes} \mid H_0) = \int_c^{\infty} f_0(x) \, dx
  • yy-axis: hit rate =P(respond yesH1)=cf1(x)dx= P(\text{respond yes} \mid H_1) = \int_c^{\infty} f_1(x) \, dx

A perfect discriminator has an ROC curve passing through (0,1)(0, 1) (zero false alarms, 100% hit rate). A random guesser lies on the diagonal from (0,0)(0, 0) to (1,1)(1, 1).

For the equal-variance Gaussian model, the ROC curve has a closed-form parameterization. Let Φ\Phi denote the standard normal CDF and Φ1\Phi^{-1} its inverse. If the false alarm rate is α=1Φ((cμ0)/σ)\alpha = 1 - \Phi((c - \mu_0)/\sigma), then:

hit rate=1Φ(Φ1(1α)d)\text{hit rate} = 1 - \Phi\left(\Phi^{-1}(1 - \alpha) - d'\right)

The ROC curve bows toward the upper-left corner. The larger dd', the more the curve bows, indicating better discrimination.

Definition

Area Under the ROC Curve (AUC)

The AUC is the integral of the ROC curve over [0,1][0, 1]:

AUC=01TPR(FPR)d(FPR)\text{AUC} = \int_0^1 \text{TPR}(\text{FPR}) \, d(\text{FPR})

AUC has a probabilistic interpretation: it equals the probability that a randomly chosen signal observation scores higher than a randomly chosen noise observation:

AUC=P(X1>X0)where X1f1,  X0f0\text{AUC} = P(X_1 > X_0) \quad \text{where } X_1 \sim f_1, \; X_0 \sim f_0

For the equal-variance Gaussian model: AUC=Φ(d/2)\text{AUC} = \Phi(d' / \sqrt{2}).

Proposition

AUC as Concordance Probability

Statement

Let X1f1X_1 \sim f_1 and X0f0X_0 \sim f_0 be independent draws from the signal and noise distributions respectively. Then:

AUC=P(X1>X0)\text{AUC} = P(X_1 > X_0)

This equals the Wilcoxon-Mann-Whitney UU statistic normalized to [0,1][0, 1].

Intuition

AUC measures how well the scoring function separates the two classes. If every signal observation scores higher than every noise observation, AUC=1\text{AUC} = 1. If scores are completely random with respect to class, AUC=0.5\text{AUC} = 0.5. AUC is the probability that the classifier correctly ranks a random signal-noise pair.

Proof Sketch

Write AUC=01TPR(α)dα\text{AUC} = \int_0^1 \text{TPR}(\alpha) \, d\alpha. By change of variables with α=P(X0>c)\alpha = P(X_0 > c):

AUC=P(X1>c)f0(c)dc=P(X1>X0)\text{AUC} = \int_{-\infty}^{\infty} P(X_1 > c) f_0(c) \, dc = P(X_1 > X_0)

The equivalence to the Mann-Whitney UU statistic follows from the fact that U/(n0n1)U/(n_0 \cdot n_1) is the empirical estimate of P(X1>X0)P(X_1 > X_0) computed over all pairs.

Why It Matters

This interpretation explains why AUC is threshold-independent: it averages over all possible operating points. In ML, AUC is the standard metric when the cost of false positives versus false negatives is unknown or when different deployment scenarios require different thresholds. AUC of 0.5 means the model is no better than random; AUC of 1.0 means perfect separation.

Failure Mode

AUC weights all thresholds equally, including regions of very high false alarm rate that are irrelevant in practice. If you care only about performance at low false positive rates (common in medical screening, fraud detection), the partial AUC over the relevant FPR range is more informative. Precision-recall curves are preferred when the negative class vastly outnumbers the positive class, because AUC can be misleadingly high when most predictions are correct simply by predicting the majority class.

Origins: Psychophysics

SDT was formalized by Green and Swets (1966) for psychophysics experiments. A classic example: an observer listens to intervals of noise and must decide whether a faint tone was present. The observer's internal representation is a noisy scalar, and dd' measures perceptual sensitivity independent of the observer's willingness to say "yes." Before SDT, psychophysics conflated sensitivity with bias. Two observers with the same perceptual ability but different response biases (one cautious, one trigger-happy) would appear to have different detection thresholds. SDT separated these two factors cleanly.

Connection to ML Classification

Modern ML evaluation is a direct descendant of SDT:

  • A classifier's predicted probability or score plays the role of the observation xx
  • The ROC curve is constructed by sweeping the classification threshold
  • AUC measures overall discrimination ability, analogous to a nonparametric measure of dd'
  • Precision and recall are SDT concepts restricted to the positive predictions
  • The Neyman-Pearson lemma explains why the likelihood ratio (or a monotone transform of it, like calibrated probabilities) is the optimal scoring function

The key insight: a well-calibrated classifier with P(class=1x)P(\text{class} = 1 \mid x) as its score function is computing the posterior, which is a monotone function of the likelihood ratio when the prior is fixed. Thresholding this posterior at different values traces the ROC curve.

Common Confusions

Watch Out

d-prime requires equal-variance Gaussian assumption

dd' is defined as (μ1μ0)/σ(\mu_1 - \mu_0)/\sigma only when both distributions are Gaussian with the same variance. If the variances differ, the ROC curve on normal-normal axes (zROC) is a straight line with slope σ0/σ11\sigma_0/\sigma_1 \neq 1, and dd' no longer fully characterizes performance. In ML, the equal-variance assumption rarely holds, so AUC (which is nonparametric) is preferred over dd'.

Watch Out

High AUC does not mean the classifier is useful at your operating point

AUC averages over all thresholds. A classifier with AUC = 0.95 might perform poorly at the specific false positive rate your application requires. Always examine the ROC curve (or precision-recall curve) at the operating point relevant to your deployment, not just the aggregate AUC.

Watch Out

ROC curves can be misleading with class imbalance

With extreme class imbalance (e.g., 1% positive, 99% negative), a classifier can achieve high AUC by ranking well without achieving useful precision. A model that assigns slightly higher scores to the rare positive class achieves good AUC, but when you threshold to get high recall, precision may be very low. Use precision-recall curves for imbalanced problems.

Exercises

ExerciseCore

Problem

Two Gaussian distributions have μ0=0\mu_0 = 0, μ1=2\mu_1 = 2, and σ=1\sigma = 1. Compute dd'. If the criterion is set at c=1c = 1 (midpoint), compute the hit rate and false alarm rate.

ExerciseCore

Problem

A binary classifier assigns scores to 5 positive and 5 negative examples. The positive scores are {0.9,0.8,0.6,0.55,0.4}\{0.9, 0.8, 0.6, 0.55, 0.4\} and the negative scores are {0.7,0.5,0.35,0.2,0.1}\{0.7, 0.5, 0.35, 0.2, 0.1\}. Compute the AUC using the concordance interpretation.

ExerciseAdvanced

Problem

Show that for the equal-variance Gaussian model, AUC=Φ(d/2)\text{AUC} = \Phi(d'/\sqrt{2}). Start from the concordance definition AUC=P(X1>X0)\text{AUC} = P(X_1 > X_0) where X1N(μ1,σ2)X_1 \sim \mathcal{N}(\mu_1, \sigma^2) and X0N(μ0,σ2)X_0 \sim \mathcal{N}(\mu_0, \sigma^2).

ExerciseAdvanced

Problem

An observer in a psychophysics experiment has d=1.5d' = 1.5 and sets a criterion corresponding to β=2\beta = 2 (conservative). For the equal-variance Gaussian model with σ=1\sigma = 1, find the criterion location cc, the hit rate, and the false alarm rate.

References

Canonical:

  • Green & Swets, Signal Detection Theory and Psychophysics (1966), Chapters 1-4
  • Macmillan & Creelman, Detection Theory: A User's Guide (2nd ed., 2004), Chapters 1-3

Current:

  • Fawcett, "An Introduction to ROC Analysis" (Pattern Recognition Letters, 2006)
  • Saito & Rehmsmeier, "The Precision-Recall Plot Is More Informative than the ROC Plot" (PLOS ONE, 2015)
  • Hand, "Measuring Classifier Performance: A Coherent Alternative to the Area Under the ROC Curve" (Machine Learning, 2009)
  • Wickens, Elementary Signal Detection Theory (2002), Chapters 2-5

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics