Methodology

ROC Curve and AUC

Theory of the receiver operating characteristic curve. AUC as a Wilcoxon-Mann-Whitney probability, cost-weighted operating-point selection, the ROC convex hull, the Neyman-Pearson connection, and partial AUC for restricted FPR regions.

AdvancedTier 2StableSupporting~35 min

Prerequisites

Confusion Matrices and Classification Metrics Common Probability Distributions

Quiz (3)Prereq Map

Why This Matters

The ROC curve is a threshold-free summary of how well a continuous score separates two classes. The basic definitions (TPR, FPR, AUC) are covered on confusion matrices and classification metrics. This page is the theory layer: what AUC actually equals, how to pick an operating point, when ROC analysis breaks down, and how the ROC frontier connects to the Neyman-Pearson lemma in detection theory.

ROC curves for likelihood-ratio detectors. The Neyman-Pearson optimum is the upper envelope at each false-alarm rate alpha.

The diagram shows a Gaussian-shift detector with $d' = 1.6$ . The smooth ROC interior is what you get when the score distributions on the two classes are continuous and overlap; the diagonal is what you get from random guessing.

Setup

Let $X$ be an input, $Y \in \{0, 1\}$ a binary label, and $s(x) \in \mathbb{R}$ a continuous score (higher = more likely positive). For a threshold $t$ , the predicted label is $\hat{Y}_t = \mathbf{1}[s(X) > t]$ . Define

\mathrm{TPR}(t) = P(s(X) > t \mid Y = 1), \qquad \mathrm{FPR}(t) = P(s(X) > t \mid Y = 0).

The ROC curve is the parametric path $\{(\mathrm{FPR}(t), \mathrm{TPR}(t)) : t \in \mathbb{R}\}$ as $t$ sweeps from $+\infty$ (origin) to $-\infty$ (top-right corner). AUC is the area under this curve, $\int_0^1 \mathrm{TPR} \, d(\mathrm{FPR})$ .

AUC as a Ranking Probability

Theorem

AUC equals the Wilcoxon-Mann-Whitney probability

Statement

Let $X^+ \sim P_{X \mid Y = 1}$ and $X^- \sim P_{X \mid Y = 0}$ be independent draws from the two class-conditional distributions. Then

\mathrm{AUC} = P\bigl(s(X^+) > s(X^-)\bigr).

When ties occur with positive probability, the standard tie-aware version is

\mathrm{AUC} = P\bigl(s(X^+) > s(X^-)\bigr) + \tfrac{1}{2}\, P\bigl(s(X^+) = s(X^-)\bigr),

which is what trapezoidal-rule integration of the empirical ROC computes.

Intuition

AUC measures ranking quality: how often the classifier orders a random positive above a random negative. AUC = 0.5 is random guessing; AUC = 1 is perfect separation. Calibration is a separate property; AUC is invariant to any monotone transformation of the score.

Proof Sketch

Let $F^-(t) = P(s(X^-) \le t)$ be the CDF of negative scores, so $\mathrm{FPR}(t) = 1 - F^-(t)$ and $d(\mathrm{FPR}) = -dF^-(t)$ . Substitute into the area integral:

\mathrm{AUC} = \int_0^1 \mathrm{TPR}\, d(\mathrm{FPR}) = -\int_{-\infty}^{\infty} P(s(X^+) > t)\, dF^-(t) = \int P(s(X^+) > t)\, dF^-(t).

The right side is $E_{t \sim F^-}[P(s(X^+) > t)] = P(s(X^+) > s(X^-))$ by Fubini. The tie-aware version follows from replacing the strict inequality with the symmetric form $\mathbf{1}[a > b] + \tfrac{1}{2}\mathbf{1}[a = b]$ .

Why It Matters

The Wilcoxon-Mann-Whitney connection identifies AUC with a U-statistic, giving distribution-free standard errors (Hanley and McNeil 1982) and a clean way to test whether two AUCs differ (DeLong et al. 1988). It also shows why AUC is invariant to monotone score transformations: the ranking is what matters.

Failure Mode

The AUC is a marginal-rank summary. It can be high while the precision at every useful operating point is low (under heavy class imbalance), and it can hide localized failures because it averages over all thresholds. Calibration breaks the WMW interpretation as a quality metric: a well-ranked score can still be miscalibrated.

report a correction →

Cost-Weighted Operating Points

A trained classifier provides scores; the operating point is the threshold that turns scores into predictions. Choosing it depends on costs. Let $c_{FN}$ be the cost of a false negative, $c_{FP}$ the cost of a false positive, and $\pi = P(Y = 1)$ the class prior.

The expected loss at threshold $t$ is

\text{Loss}(t) = \pi\, c_{FN}\, (1 - \mathrm{TPR}(t)) + (1 - \pi)\, c_{FP}\, \mathrm{FPR}(t).

Geometrically, level sets of the loss are parallel lines in ROC space, called iso-cost lines, with slope

m = \frac{(1 - \pi)\, c_{FP}}{\pi\, c_{FN}}.

The Bayes-optimal operating point is the ROC point at which an iso-cost line of this slope is tangent to the ROC curve (or to its convex hull). When costs are symmetric and classes are balanced, $m = 1$ and the tangent point sits where the ROC curve bends most sharply.

Watch Out

The 0.5 threshold is universally optimal under 0-1 loss

The 0.5 threshold is optimal only when costs are symmetric and the classifier outputs calibrated posterior probabilities $P(Y = 1 \mid x)$ . Asymmetric costs shift the threshold to $c_{FP} / (c_{FP} + c_{FN})$ on the calibrated posterior; uncalibrated scores require a separate threshold sweep on validation data.

Convex Hull and Randomization

A ROC curve produced by a single classifier with one continuous score is not always concave. Suppose two operating points $A$ and $B$ on the curve have a higher TPR-at-given-FPR than any point between them on the curve itself. Any ROC point on the line segment $\overline{AB}$ is achievable by randomizing between the two classifiers in the appropriate proportion.

Formally, with probability $\lambda$ use the classifier at $A$ (giving $(\mathrm{FPR}_A, \mathrm{TPR}_A)$ ) and with probability $1 - \lambda$ use the classifier at $B$ . The combined classifier achieves

\bigl(\lambda\, \mathrm{FPR}_A + (1 - \lambda)\, \mathrm{FPR}_B,\ \lambda\, \mathrm{TPR}_A + (1 - \lambda)\, \mathrm{TPR}_B\bigr).

This means the ROC convex hull is the achievable frontier, not the raw ROC curve. Any iso-cost optimal operating point lies on the hull, possibly via randomization between two deterministic operating points.

Connection to the Neyman-Pearson Lemma

Detection theory reaches the same frontier from a different direction. The Neyman-Pearson lemma says the most powerful test of $H_0: Y = 0$ versus $H_1: Y = 1$ at significance level $\alpha$ is the likelihood-ratio test:

\Lambda(x) = \frac{p(x \mid Y = 1)}{p(x \mid Y = 0)} > t_\alpha,

with $t_\alpha$ chosen so that $P(\Lambda(X) > t_\alpha \mid Y = 0) = \alpha$ . As $\alpha$ varies, the family of NP-optimal tests traces out a curve in $(\mathrm{FPR}, \mathrm{TPR})$ space, and that curve is the ROC convex hull of the underlying problem.

So the ROC convex hull characterizes the best achievable trade-off between Type I and Type II error for the given class-conditional distributions, regardless of which classifier you use. Any classifier strictly inside the hull is dominated; any classifier on the hull is admissible.

Partial AUC

Many real applications care only about the low-FPR region of the ROC curve: medical screening cannot tolerate FPR > 0.10, fraud detection cannot tolerate FPR > 0.01. The partial AUC restricts the integration:

\mathrm{pAUC}(\alpha) = \int_0^\alpha \mathrm{TPR}(u)\, du.

The standardized partial AUC divides $\mathrm{pAUC}(\alpha)$ by $\alpha$ , giving a number in $[0, 1]$ comparable to ordinary AUC. Two classifiers with identical full AUC can have very different $\mathrm{pAUC}(0.1)$ scores: one might dominate at low FPR while the other catches up at high FPR. When deployment constraints fix the operating regime, partial AUC is the honest summary.

ROC vs Precision-Recall Under Imbalance

Under heavy class imbalance, ROC analysis can mislead. The mechanism: FPR normalizes by the (huge) negative class, while precision normalizes by predicted-positive count, which is dominated by false positives even when FPR is small.

A worked example illustrates the gap. Suppose prevalence is $\pi = 0.001$ (one positive per thousand examples) and the classifier achieves TPR = 0.95, FPR = 0.05 at some threshold. At a test set of 10,000 examples (10 positives, 9,990 negatives):

Quantity	Value
True positives	$0.95 \times 10 \approx 9.5$
False positives	$0.05 \times 9990 \approx 499.5$
Predicted positives	$\approx 509$
Recall (TPR)	0.95
Precision	$9.5 / 509 \approx 0.019$

The classifier looks excellent on the ROC curve but precision is below 2%. The ROC AUC can sit near 0.95 across the same threshold sweep while every useful operating point has miserable precision. Under heavy imbalance, the precision-recall view (see confusion matrices and classification metrics) is the honest summary.

Numeric Example: Computing AUC by Trapezoidal Rule

A binary classifier produces scores on a small test set with 5 positives and 5 negatives. Sweeping the threshold yields the following ROC table:

Threshold	TPR	FPR
$+\infty$	0.0	0.0
0.9	0.4	0.0
0.7	0.6	0.2
0.5	0.8	0.2
0.3	1.0	0.6
$-\infty$	1.0	1.0

The trapezoidal-rule estimate of the area is

\mathrm{AUC} = \tfrac{1}{2}\bigl[(0.0 + 0.4)(0.0) + (0.4 + 0.6)(0.2) + (0.6 + 0.8)(0.0) + (0.8 + 1.0)(0.4) + (1.0 + 1.0)(0.4)\bigr] = 0.86.

The same number is recovered by counting positive-negative pairs in which the positive scores higher. With 5 positives and 5 negatives, there are 25 such pairs; the trapezoidal area equals the fraction of these in which positive > negative, plus half the fraction of ties. (The arithmetic above ignores ties for clarity; the tie-aware count produces the exact U-statistic value.)

Common Confusions

Watch Out

ROC and PR curves are interchangeable summaries

They diverge sharply under class imbalance. A model with ROC AUC = 0.95 can have PR AUC near 0.10 when prevalence is 0.001. Use ROC when classes are balanced or the operating regime is unknown; use PR when positives are rare and you care about precision at deployable recall levels.

Watch Out

High AUC means well-calibrated probabilities

AUC measures ranking only; it is invariant to monotone score transformations. A perfectly ranking model whose scores are squashed by a sigmoid becomes badly miscalibrated without changing AUC at all. Calibration must be measured separately, e.g. with reliability diagrams or proper scoring rules.

Watch Out

The ROC curve is always concave

The empirical ROC of a single classifier on a finite sample need not be concave. The achievable frontier is the convex hull, and intermediate points on the hull are reached by randomizing between adjacent deterministic operating points.

Exercises

ExerciseCore

Problem

A spam detector outputs scores on a test set with 4 spam and 4 non-spam examples. Sorted by score (highest first), the labels are: spam, spam, non-spam, spam, non-spam, non-spam, spam, non-spam. Compute AUC by counting the fraction of (spam, non-spam) pairs in which the spam scores higher.

ExerciseAdvanced

Problem

Let two classifiers $A$ and $B$ have the same ROC AUC on a test set, but $A$ dominates $B$ on the FPR interval $[0, 0.1]$ while $B$ dominates $A$ on $[0.5, 1]$ . Argue from the partial-AUC definition that for a screening application that allows FPR at most 0.1, classifier $A$ is strictly preferred regardless of the equal AUCs. State the implicit assumption your argument requires about the operational threshold.

References

Canonical:

Fawcett, "An Introduction to ROC Analysis," Pattern Recognition Letters 27(8):861-874 (2006).
Hanley and McNeil, "The Meaning and Use of the Area under a Receiver Operating Characteristic Curve," Radiology 143(1):29-36 (1982).
Bamber, "The Area above the Ordinal Dominance Graph and the Area below the Receiver Operating Characteristic Graph," Journal of Mathematical Psychology 12(4):387-415 (1975).
DeLong, DeLong, and Clarke-Pearson, "Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves," Biometrics 44(3):837-845 (1988).

Current:

Davis and Goadrich, "The Relationship between Precision-Recall and ROC Curves," ICML (2006).
Saito and Rehmsmeier, "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets," PLOS ONE 10(3):e0118432 (2015).
McClish, "Analyzing a Portion of the ROC Curve," Medical Decision Making 9(3):190-195 (1989). Partial-AUC reference.
Murphy, Probabilistic Machine Learning: An Introduction (2022), Chapter 5.1 on Bayesian decision theory and Chapter 5.4 on classification metrics.

Frontier:

Lobo, Jiménez-Valverde, and Real, "AUC: A Misleading Measure of the Performance of Predictive Distribution Models," Global Ecology and Biogeography 17(2):145-151 (2008). Critique of full-AUC under imbalance.

Next Topics

Proper scoring rules: ranking is one axis of forecast quality; calibration is the other.
Calibration and uncertainty: how to fix miscalibration without losing ranking quality.
Confusion matrices and classification metrics: the rest of the binary-classification metric family, including precision-recall curves.

Last reviewed: May 3, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

Common Probability Distributionslayer 0A · tier 1
Confusion Matrices and Classification Metricslayer 1 · tier 1

Derived topics

No published topic currently declares this as a prerequisite.