Methodology
ROC Curve and AUC
Theory of the receiver operating characteristic curve. AUC as a Wilcoxon-Mann-Whitney probability, cost-weighted operating-point selection, the ROC convex hull, the Neyman-Pearson connection, and partial AUC for restricted FPR regions.
Why This Matters
The ROC curve is a threshold-free summary of how well a continuous score separates two classes. The basic definitions (TPR, FPR, AUC) are covered on confusion matrices and classification metrics. This page is the theory layer: what AUC actually equals, how to pick an operating point, when ROC analysis breaks down, and how the ROC frontier connects to the Neyman-Pearson lemma in detection theory.
ROC curves for likelihood-ratio detectors. The Neyman-Pearson optimum is the upper envelope at each false-alarm rate alpha.
The diagram shows a Gaussian-shift detector with . The smooth ROC interior is what you get when the score distributions on the two classes are continuous and overlap; the diagonal is what you get from random guessing.
Setup
Let be an input, a binary label, and a continuous score (higher = more likely positive). For a threshold , the predicted label is . Define
The ROC curve is the parametric path as sweeps from (origin) to (top-right corner). AUC is the area under this curve, .
AUC as a Ranking Probability
AUC equals the Wilcoxon-Mann-Whitney probability
Statement
Let and be independent draws from the two class-conditional distributions. Then
When ties occur with positive probability, the standard tie-aware version is
which is what trapezoidal-rule integration of the empirical ROC computes.
Intuition
AUC measures ranking quality: how often the classifier orders a random positive above a random negative. AUC = 0.5 is random guessing; AUC = 1 is perfect separation. Calibration is a separate property; AUC is invariant to any monotone transformation of the score.
Proof Sketch
Let be the CDF of negative scores, so and . Substitute into the area integral:
The right side is by Fubini. The tie-aware version follows from replacing the strict inequality with the symmetric form .
Why It Matters
The Wilcoxon-Mann-Whitney connection identifies AUC with a U-statistic, giving distribution-free standard errors (Hanley and McNeil 1982) and a clean way to test whether two AUCs differ (DeLong et al. 1988). It also shows why AUC is invariant to monotone score transformations: the ranking is what matters.
Failure Mode
The AUC is a marginal-rank summary. It can be high while the precision at every useful operating point is low (under heavy class imbalance), and it can hide localized failures because it averages over all thresholds. Calibration breaks the WMW interpretation as a quality metric: a well-ranked score can still be miscalibrated.
Cost-Weighted Operating Points
A trained classifier provides scores; the operating point is the threshold that turns scores into predictions. Choosing it depends on costs. Let be the cost of a false negative, the cost of a false positive, and the class prior.
The expected loss at threshold is
Geometrically, level sets of the loss are parallel lines in ROC space, called iso-cost lines, with slope
The Bayes-optimal operating point is the ROC point at which an iso-cost line of this slope is tangent to the ROC curve (or to its convex hull). When costs are symmetric and classes are balanced, and the tangent point sits where the ROC curve bends most sharply.
The 0.5 threshold is universally optimal under 0-1 loss
The 0.5 threshold is optimal only when costs are symmetric and the classifier outputs calibrated posterior probabilities . Asymmetric costs shift the threshold to on the calibrated posterior; uncalibrated scores require a separate threshold sweep on validation data.
Convex Hull and Randomization
A ROC curve produced by a single classifier with one continuous score is not always concave. Suppose two operating points and on the curve have a higher TPR-at-given-FPR than any point between them on the curve itself. Any ROC point on the line segment is achievable by randomizing between the two classifiers in the appropriate proportion.
Formally, with probability use the classifier at (giving ) and with probability use the classifier at . The combined classifier achieves
This means the ROC convex hull is the achievable frontier, not the raw ROC curve. Any iso-cost optimal operating point lies on the hull, possibly via randomization between two deterministic operating points.
Connection to the Neyman-Pearson Lemma
Detection theory reaches the same frontier from a different direction. The Neyman-Pearson lemma says the most powerful test of versus at significance level is the likelihood-ratio test:
with chosen so that . As varies, the family of NP-optimal tests traces out a curve in space, and that curve is the ROC convex hull of the underlying problem.
So the ROC convex hull characterizes the best achievable trade-off between Type I and Type II error for the given class-conditional distributions, regardless of which classifier you use. Any classifier strictly inside the hull is dominated; any classifier on the hull is admissible.
Partial AUC
Many real applications care only about the low-FPR region of the ROC curve: medical screening cannot tolerate FPR > 0.10, fraud detection cannot tolerate FPR > 0.01. The partial AUC restricts the integration:
The standardized partial AUC divides by , giving a number in comparable to ordinary AUC. Two classifiers with identical full AUC can have very different scores: one might dominate at low FPR while the other catches up at high FPR. When deployment constraints fix the operating regime, partial AUC is the honest summary.
ROC vs Precision-Recall Under Imbalance
Under heavy class imbalance, ROC analysis can mislead. The mechanism: FPR normalizes by the (huge) negative class, while precision normalizes by predicted-positive count, which is dominated by false positives even when FPR is small.
A worked example illustrates the gap. Suppose prevalence is (one positive per thousand examples) and the classifier achieves TPR = 0.95, FPR = 0.05 at some threshold. At a test set of 10,000 examples (10 positives, 9,990 negatives):
| Quantity | Value |
|---|---|
| True positives | |
| False positives | |
| Predicted positives | |
| Recall (TPR) | 0.95 |
| Precision |
The classifier looks excellent on the ROC curve but precision is below 2%. The ROC AUC can sit near 0.95 across the same threshold sweep while every useful operating point has miserable precision. Under heavy imbalance, the precision-recall view (see confusion matrices and classification metrics) is the honest summary.
Numeric Example: Computing AUC by Trapezoidal Rule
A binary classifier produces scores on a small test set with 5 positives and 5 negatives. Sweeping the threshold yields the following ROC table:
| Threshold | TPR | FPR |
|---|---|---|
| 0.0 | 0.0 | |
| 0.9 | 0.4 | 0.0 |
| 0.7 | 0.6 | 0.2 |
| 0.5 | 0.8 | 0.2 |
| 0.3 | 1.0 | 0.6 |
| 1.0 | 1.0 |
The trapezoidal-rule estimate of the area is
The same number is recovered by counting positive-negative pairs in which the positive scores higher. With 5 positives and 5 negatives, there are 25 such pairs; the trapezoidal area equals the fraction of these in which positive > negative, plus half the fraction of ties. (The arithmetic above ignores ties for clarity; the tie-aware count produces the exact U-statistic value.)
Common Confusions
ROC and PR curves are interchangeable summaries
They diverge sharply under class imbalance. A model with ROC AUC = 0.95 can have PR AUC near 0.10 when prevalence is 0.001. Use ROC when classes are balanced or the operating regime is unknown; use PR when positives are rare and you care about precision at deployable recall levels.
High AUC means well-calibrated probabilities
AUC measures ranking only; it is invariant to monotone score transformations. A perfectly ranking model whose scores are squashed by a sigmoid becomes badly miscalibrated without changing AUC at all. Calibration must be measured separately, e.g. with reliability diagrams or proper scoring rules.
The ROC curve is always concave
The empirical ROC of a single classifier on a finite sample need not be concave. The achievable frontier is the convex hull, and intermediate points on the hull are reached by randomizing between adjacent deterministic operating points.
Exercises
Problem
A spam detector outputs scores on a test set with 4 spam and 4 non-spam examples. Sorted by score (highest first), the labels are: spam, spam, non-spam, spam, non-spam, non-spam, spam, non-spam. Compute AUC by counting the fraction of (spam, non-spam) pairs in which the spam scores higher.
Problem
Let two classifiers and have the same ROC AUC on a test set, but dominates on the FPR interval while dominates on . Argue from the partial-AUC definition that for a screening application that allows FPR at most 0.1, classifier is strictly preferred regardless of the equal AUCs. State the implicit assumption your argument requires about the operational threshold.
References
Canonical:
- Fawcett, "An Introduction to ROC Analysis," Pattern Recognition Letters 27(8):861-874 (2006).
- Hanley and McNeil, "The Meaning and Use of the Area under a Receiver Operating Characteristic Curve," Radiology 143(1):29-36 (1982).
- Bamber, "The Area above the Ordinal Dominance Graph and the Area below the Receiver Operating Characteristic Graph," Journal of Mathematical Psychology 12(4):387-415 (1975).
- DeLong, DeLong, and Clarke-Pearson, "Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves," Biometrics 44(3):837-845 (1988).
Current:
- Davis and Goadrich, "The Relationship between Precision-Recall and ROC Curves," ICML (2006).
- Saito and Rehmsmeier, "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets," PLOS ONE 10(3):e0118432 (2015).
- McClish, "Analyzing a Portion of the ROC Curve," Medical Decision Making 9(3):190-195 (1989). Partial-AUC reference.
- Murphy, Probabilistic Machine Learning: An Introduction (2022), Chapter 5.1 on Bayesian decision theory and Chapter 5.4 on classification metrics.
Frontier:
- Lobo, Jiménez-Valverde, and Real, "AUC: A Misleading Measure of the Performance of Predictive Distribution Models," Global Ecology and Biogeography 17(2):145-151 (2008). Critique of full-AUC under imbalance.
Next Topics
- Proper scoring rules: ranking is one axis of forecast quality; calibration is the other.
- Calibration and uncertainty: how to fix miscalibration without losing ranking quality.
- Confusion matrices and classification metrics: the rest of the binary-classification metric family, including precision-recall curves.
Last reviewed: May 3, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
2- Common Probability Distributionslayer 0A · tier 1
- Confusion Matrices and Classification Metricslayer 1 · tier 1
Derived topics
0No published topic currently declares this as a prerequisite.