Confusion Matrices and Classification Metrics

Sneiderman, Robby

Methodology

Confusion Matrices and Classification Metrics

The complete framework for evaluating binary classifiers: confusion matrix entries, precision, recall, F1, specificity, ROC curves, AUC, and precision-recall curves. When to use each metric.

CoreTier 1StableSupporting~45 min

Prerequisites

Common Probability Distributions Multi Class and Multi Label Classification Signal Detection Theory

Start 8-question practice · 13 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

methodology | layer 1 | tier 1. This page has 3 direct prerequisites and 5 published dependents.

Open Atlas Prerequisites Leads to

What next

Base Rate Fallacy

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Every binary classifier you train must be evaluated, and the choice of evaluation metric determines what "good" means. Accuracy is the default metric, but it is misleading whenever classes are imbalanced (see class imbalance and resampling for mitigation strategies). Precision, recall, F1, ROC AUC, and precision-recall AUC each answer different questions. Choosing the wrong metric can lead you to deploy a model that fails on the task that actually matters.

The Confusion Matrix

Definition

Confusion Matrix

For a binary classifier with classes "positive" (P) and "negative" (N), the confusion matrix has four entries:

	Predicted P	Predicted N
Actual P	True Positive (TP)	False Negative (FN)
Actual N	False Positive (FP)	True Negative (TN)

Every test example falls into exactly one cell. All classification metrics are functions of these four counts.

Core Metrics

Definition

Accuracy

$\text{Accuracy} = \frac{TP + TN}{TP + FP + TN + FN}$

The fraction of all predictions that are correct.

Definition

Precision $P P V$

$\text{Precision} = \frac{TP}{TP + FP}$

Of all examples predicted positive, what fraction is truly positive? Also called positive predictive value. Answers: "when the model says positive, how often is it right?"

Definition

Recall $T P R$

$\text{Recall} = \frac{TP}{TP + FN}$

Of all truly positive examples, what fraction does the model identify? Also called sensitivity or true positive rate. Answers: "of all actual positives, how many does the model find?"

Definition

Specificity $T N R$

$\text{Specificity} = \frac{TN}{TN + FP}$

Of all truly negative examples, what fraction does the model correctly classify as negative? Also called true negative rate. Answers: "of all actual negatives, how many does the model correctly reject?"

Definition

False Positive Rate $F P R$

$\text{FPR} = \frac{FP}{FP + TN} = 1 - \text{Specificity}$

The fraction of truly negative examples that the model incorrectly classifies as positive.

F1 Score

Definition

F1 Score

$F_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2TP}{2TP + FP + FN}$

The harmonic mean of precision and recall.

Proposition

F1 as Harmonic Mean

Statement

The F1 score equals the harmonic mean of precision $P$ and recall $R$ :

$F_1 = \frac{2PR}{P + R}$

The harmonic mean satisfies $\min(P, R) \leq F_1 \leq \max(P, R)$ , with both inequalities tight exactly when $P = R$ . For nonnegative $P,R$ with $P + R > 0$ , $F_1 = 0$ if and only if $P = 0$ or $R = 0$ .

Intuition

The harmonic mean punishes imbalance between precision and recall. If precision is 1.0 and recall is 0.01, the arithmetic mean is 0.505 (looks fine), but the harmonic mean is 0.0198 (correctly indicates a bad classifier). You cannot get a high F1 without both precision and recall being high.

Proof Sketch

Assume without loss of generality that $0 < R \leq P$ . Then $F_1 = 2PR/(P+R) \geq R$ because $2P \geq P + R$ , and $F_1 \leq P$ because $2R \leq P + R$ . Thus $R \leq F_1 \leq P$ , with equality throughout only when $P = R$ . The boundary case follows directly from $F_1 = 2PR/(P+R)$ when $P+R>0$ .

Why It Matters

F1 is the standard metric for imbalanced classification tasks where both false positives and false negatives matter. It is used in NER, information extraction, medical diagnosis, and fraud detection. The $F_\beta$ generalization lets you weight precision vs recall: $F_\beta = (1 + \beta^2) PR / (\beta^2 P + R)$ .

Failure Mode

F1 ignores true negatives entirely. For tasks where correctly identifying negatives matters (e.g., screening out safe emails), F1 is incomplete. Also, F1 treats precision and recall as equally important. If one matters more, use $F_\beta$ or optimize precision/recall directly.

report a correction →

ROC Curve and AUC

Definition

ROC Curve

The Receiver Operating Characteristic curve plots true positive rate (recall) on the y-axis against false positive rate on the x-axis, as the classification threshold varies from $+\infty$ to $-\infty$ . Each threshold produces a point $(FPR, TPR)$ .

Theorem

AUC as Ranking Probability

Statement

The area under the ROC curve (AUC) equals the probability that a randomly chosen positive example receives a higher score than a randomly chosen negative example:

$\text{AUC} = P(s(x^+) > s(x^-))$

where $x^+$ is drawn from the positive class and $x^-$ from the negative class. When ties occur with positive probability (the typical case for discrete or quantized scores), the standard tie-aware version is

$\text{AUC} = P(s(x^+) > s(x^-)) + \tfrac{1}{2}\, P(s(x^+) = s(x^-)),$

which is what the trapezoidal-rule estimate of the empirical ROC area computes.

Intuition

AUC measures ranking quality, not calibration. A model with AUC = 0.9 correctly ranks a random positive above a random negative 90% of the time. AUC = 0.5 is random guessing. AUC = 1.0 is perfect separation.

Proof Sketch

The ROC curve can be written as $TPR(t) = P(s(x) > t \mid x \in P)$ and $FPR(t) = P(s(x) > t \mid x \in N)$ . Integrating $TPR$ with respect to $FPR$ over all thresholds yields $\int_0^1 TPR \, dFPR$ . By a change of variable argument (see Fawcett 2006), this equals $P(s(x^+) > s(x^-))$ , which is the Wilcoxon-Mann-Whitney statistic.

Why It Matters

AUC is threshold-independent: it evaluates the model's ability to rank, regardless of the specific threshold chosen for classification. This makes it useful for comparing models when the operating point has not been decided.

Failure Mode

AUC can be misleading for highly imbalanced datasets. A model that achieves very low precision at all recall levels can still have high AUC if it ranks most positives above most negatives. In such cases, the precision-recall curve is more informative because it focuses on the positive class.

report a correction →

Precision-Recall Curve

The precision-recall (PR) curve plots precision on the y-axis against recall on the x-axis as the threshold varies. Unlike ROC curves, PR curves are sensitive to class imbalance.

Key properties:

A random classifier on a dataset with fraction $\pi$ positives gives a horizontal line at precision $= \pi$
The PR AUC (average precision) is a better summary than ROC AUC when the positive class is rare
PR curves make poor classifiers visible: low precision at moderate recall stands out

When to Use Each Metric

Metric	Use when...
Accuracy	Classes are balanced and errors are equally costly
Precision	False positives are expensive (spam filtering, recommender systems)
Recall	False negatives are expensive (disease screening, fraud detection)
F1	Both false positives and false negatives matter equally
ROC AUC	Comparing models across all thresholds with balanced classes
PR AUC	Evaluating on highly imbalanced datasets

Common Confusions

Watch Out

High accuracy means a good model

A model predicting the majority class for every input achieves accuracy equal to the majority class proportion. On a dataset with 99% negatives, always predicting "negative" gives 99% accuracy with zero recall on the positive class.

Watch Out

ROC AUC is always the right summary metric

ROC AUC can look high even when precision is low at all useful recall levels. For imbalanced datasets, precision-recall AUC gives a more honest picture of model quality on the minority class.

Watch Out

Precision and recall are independent

They are not. For a fixed model, changing the classification threshold increases one at the expense of the other. A lower threshold catches more positives (higher recall) but also includes more false positives (lower precision). This tradeoff is the entire content of the PR curve.

Canonical Examples

Example

Medical screening with 2% prevalence

1000 patients, 20 positive, 980 negative. A classifier has 90% recall and 95% specificity.

TP = 18, FN = 2, FP = 49, TN = 931
Precision = 18/67 = 26.9%
Recall = 18/20 = 90%
F1 = 2(0.269)(0.9)/(0.269 + 0.9) = 0.414
Accuracy = 949/1000 = 94.9%

Accuracy looks good. Precision reveals that 73% of positive predictions are wrong.

Exercises

ExerciseCore

Problem

A classifier produces the following confusion matrix on a test set of 500 examples: TP = 40, FP = 10, FN = 20, TN = 430. Compute accuracy, precision, recall, specificity, and F1.

ExerciseAdvanced

Problem

Prove that for any classifier, if you negate all predictions (swap positive and negative), the ROC curve is reflected through the point $(0.5, 0.5)$ . What happens to the AUC?

References

Canonical:

Fawcett, "An Introduction to ROC Analysis", Pattern Recognition Letters (2006)
Manning, Raghavan, Schutze, Introduction to Information Retrieval (2008), Chapter 8
Davis & Goadrich, "The Relationship Between Precision-Recall and ROC Curves", ICML (2006)
Powers, "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation", Journal of Machine Learning Technologies (2011)
Hastie, Tibshirani & Friedman, The Elements of Statistical Learning (2009), Chapter 7

Current:

Saito & Rehmsmeier, "The Precision-Recall Plot Is More Informative than the ROC Plot", PLOS ONE (2015)
Lipton, Elkan & Narayanaswamy, "Optimal Thresholding of Classifiers to Maximize F1 Measure", ECML PKDD (2014)
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11 and 17

Next Topics

Base-rate fallacy: why precision drops dramatically with rare positive classes
Cross-validation theory: how to reliably estimate these metrics from finite data

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Common Probability Distributionslayer 0A · tier 1
Multi-Class and Multi-Label Classificationlayer 1 · tier 2
Signal Detection Theorylayer 2 · tier 2

Derived topics

5

Model Evaluation Best Practiceslayer 1 · tier 1
Base Rate Fallacylayer 1 · tier 2
Class Imbalance and Resamplinglayer 1 · tier 2
Cross-Validation Theorylayer 2 · tier 2
ROC Curve and AUClayer 2 · tier 2

Graph-backed continuations

Base Rate Fallacy Cross-Validation Theory Class Imbalance and Resampling Model Evaluation Best Practices ROC Curve and AUC