Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Methodology

Confusion Matrices and Classification Metrics

The complete framework for evaluating binary classifiers: confusion matrix entries, precision, recall, F1, specificity, ROC curves, AUC, and precision-recall curves. When to use each metric.

CoreTier 1Stable~45 min

Why This Matters

Every binary classifier you train must be evaluated, and the choice of evaluation metric determines what "good" means. Accuracy is the default metric, but it is misleading whenever classes are imbalanced (see class imbalance and resampling for mitigation strategies). Precision, recall, F1, ROC AUC, and precision-recall AUC each answer different questions. Choosing the wrong metric can lead you to deploy a model that fails on the task that actually matters.

Predicted +Predicted -PredictedActual +Actual -Actual85TP15FN10FP890TNPrecisionTP/(TP+FP) = 0.895RecallTP/(TP+FN) = 0.850F1 Score2PR/(P+R) = 0.872Accuracy(TP+TN)/N = 0.975Accuracy is 98.9% but only 85% of actual positives are caught (recall). Precision and recall tell the real story.

The Confusion Matrix

Definition

Confusion Matrix

For a binary classifier with classes "positive" (P) and "negative" (N), the confusion matrix has four entries:

Predicted PPredicted N
Actual PTrue Positive (TP)False Negative (FN)
Actual NFalse Positive (FP)True Negative (TN)

Every test example falls into exactly one cell. All classification metrics are functions of these four counts.

Core Metrics

Definition

Accuracy

Accuracy=TP+TNTP+FP+TN+FN\text{Accuracy} = \frac{TP + TN}{TP + FP + TN + FN}

The fraction of all predictions that are correct.

Definition

Precision

Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}

Of all examples predicted positive, what fraction is truly positive? Also called positive predictive value. Answers: "when the model says positive, how often is it right?"

Definition

Recall

Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}

Of all truly positive examples, what fraction does the model identify? Also called sensitivity or true positive rate. Answers: "of all actual positives, how many does the model find?"

Definition

Specificity

Specificity=TNTN+FP\text{Specificity} = \frac{TN}{TN + FP}

Of all truly negative examples, what fraction does the model correctly classify as negative? Also called true negative rate. Answers: "of all actual negatives, how many does the model correctly reject?"

Definition

False Positive Rate

FPR=FPFP+TN=1Specificity\text{FPR} = \frac{FP}{FP + TN} = 1 - \text{Specificity}

The fraction of truly negative examples that the model incorrectly classifies as positive.

F1 Score

Definition

F1 Score

F1=2PrecisionRecallPrecision+Recall=2TP2TP+FP+FNF_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2TP}{2TP + FP + FN}

The harmonic mean of precision and recall.

Proposition

F1 as Harmonic Mean

Statement

The F1 score equals the harmonic mean of precision PP and recall RR:

F1=2PRP+RF_1 = \frac{2PR}{P + R}

The harmonic mean satisfies F1min(P,R)max(P,R)F_1 \leq \min(P, R) \leq \max(P, R) with equality when P=RP = R. If either PP or RR is zero, F1=0F_1 = 0.

Intuition

The harmonic mean punishes imbalance between precision and recall. If precision is 1.0 and recall is 0.01, the arithmetic mean is 0.505 (looks fine), but the harmonic mean is 0.0198 (correctly indicates a bad classifier). You cannot get a high F1 without both precision and recall being high.

Proof Sketch

The inequality F1min(P,R)F_1 \leq \min(P, R) follows from the harmonic mean being at most the geometric mean, which is at most the minimum. Alternatively, F1=2PR/(P+R)2Pmin(P,R)/(P+R)F_1 = 2PR/(P+R) \leq 2P \cdot \min(P,R)/(P + R). When RPR \leq P: F1=2PR/(P+R)2PR/(2R)=PF_1 = 2PR/(P+R) \leq 2PR/(2R) = P, and trivially F1RF_1 \leq R by a symmetric argument.

Why It Matters

F1 is the standard metric for imbalanced classification tasks where both false positives and false negatives matter. It is used in NER, information extraction, medical diagnosis, and fraud detection. The FβF_\beta generalization lets you weight precision vs recall: Fβ=(1+β2)PR/(β2P+R)F_\beta = (1 + \beta^2) PR / (\beta^2 P + R).

Failure Mode

F1 ignores true negatives entirely. For tasks where correctly identifying negatives matters (e.g., screening out safe emails), F1 is incomplete. Also, F1 treats precision and recall as equally important. If one matters more, use FβF_\beta or optimize precision/recall directly.

ROC Curve and AUC

Definition

ROC Curve

The Receiver Operating Characteristic curve plots true positive rate (recall) on the y-axis against false positive rate on the x-axis, as the classification threshold varies from ++\infty to -\infty. Each threshold produces a point (FPR,TPR)(FPR, TPR).

Theorem

AUC as Ranking Probability

Statement

The area under the ROC curve (AUC) equals the probability that a randomly chosen positive example receives a higher score than a randomly chosen negative example:

AUC=P(s(x+)>s(x))\text{AUC} = P(s(x^+) > s(x^-))

where x+x^+ is drawn from the positive class and xx^- from the negative class.

Intuition

AUC measures ranking quality, not calibration. A model with AUC = 0.9 correctly ranks a random positive above a random negative 90% of the time. AUC = 0.5 is random guessing. AUC = 1.0 is perfect separation.

Proof Sketch

The ROC curve can be written as TPR(t)=P(s(x)>txP)TPR(t) = P(s(x) > t \mid x \in P) and FPR(t)=P(s(x)>txN)FPR(t) = P(s(x) > t \mid x \in N). Integrating TPRTPR with respect to FPRFPR over all thresholds yields 01TPRdFPR\int_0^1 TPR \, dFPR. By a change of variable argument (see Fawcett 2006), this equals P(s(x+)>s(x))P(s(x^+) > s(x^-)), which is the Wilcoxon-Mann-Whitney statistic.

Why It Matters

AUC is threshold-independent: it evaluates the model's ability to rank, regardless of the specific threshold chosen for classification. This makes it useful for comparing models when the operating point has not been decided.

Failure Mode

AUC can be misleading for highly imbalanced datasets. A model that achieves very low precision at all recall levels can still have high AUC if it ranks most positives above most negatives. In such cases, the precision-recall curve is more informative because it focuses on the positive class.

Precision-Recall Curve

The precision-recall (PR) curve plots precision on the y-axis against recall on the x-axis as the threshold varies. Unlike ROC curves, PR curves are sensitive to class imbalance.

Key properties:

  • A random classifier on a dataset with fraction π\pi positives gives a horizontal line at precision =π= \pi
  • The PR AUC (average precision) is a better summary than ROC AUC when the positive class is rare
  • PR curves make poor classifiers visible: low precision at moderate recall stands out

When to Use Each Metric

MetricUse when...
AccuracyClasses are balanced and errors are equally costly
PrecisionFalse positives are expensive (spam filtering, recommender systems)
RecallFalse negatives are expensive (disease screening, fraud detection)
F1Both false positives and false negatives matter equally
ROC AUCComparing models across all thresholds with balanced classes
PR AUCEvaluating on highly imbalanced datasets

Common Confusions

Watch Out

High accuracy means a good model

A model predicting the majority class for every input achieves accuracy equal to the majority class proportion. On a dataset with 99% negatives, always predicting "negative" gives 99% accuracy with zero recall on the positive class.

Watch Out

ROC AUC is always the right summary metric

ROC AUC can look high even when precision is low at all useful recall levels. For imbalanced datasets, precision-recall AUC gives a more honest picture of model quality on the minority class.

Watch Out

Precision and recall are independent

They are not. For a fixed model, changing the classification threshold increases one at the expense of the other. A lower threshold catches more positives (higher recall) but also includes more false positives (lower precision). This tradeoff is the entire content of the PR curve.

Canonical Examples

Example

Medical screening with 2% prevalence

1000 patients, 20 positive, 980 negative. A classifier has 90% recall and 95% specificity.

  • TP = 18, FN = 2, FP = 49, TN = 931
  • Precision = 18/67 = 26.9%
  • Recall = 18/20 = 90%
  • F1 = 2(0.269)(0.9)/(0.269 + 0.9) = 0.414
  • Accuracy = 949/1000 = 94.9%

Accuracy looks good. Precision reveals that 73% of positive predictions are wrong.

Exercises

ExerciseCore

Problem

A classifier produces the following confusion matrix on a test set of 500 examples: TP = 40, FP = 10, FN = 20, TN = 430. Compute accuracy, precision, recall, specificity, and F1.

ExerciseAdvanced

Problem

Prove that for any classifier, if you negate all predictions (swap positive and negative), the ROC curve is reflected through the point (0.5,0.5)(0.5, 0.5). What happens to the AUC?

References

Canonical:

  • Fawcett, "An Introduction to ROC Analysis", Pattern Recognition Letters (2006)
  • Manning, Raghavan, Schutze, Introduction to Information Retrieval (2008), Chapter 8

Current:

  • Saito & Rehmsmeier, "The Precision-Recall Plot Is More Informative than the ROC Plot", PLOS ONE (2015)

  • Lipton, Elkan, Naryanaswamy, "Optimal Thresholding of Classifiers to Maximize F1 Measure", ECML PKDD (2014)

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14

Next Topics

  • Base-rate fallacy: why precision drops dramatically with rare positive classes
  • Cross-validation theory: how to reliably estimate these metrics from finite data

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics