Methodology
Confusion Matrices and Classification Metrics
The complete framework for evaluating binary classifiers: confusion matrix entries, precision, recall, F1, specificity, ROC curves, AUC, and precision-recall curves. When to use each metric.
Prerequisites
Why This Matters
Every binary classifier you train must be evaluated, and the choice of evaluation metric determines what "good" means. Accuracy is the default metric, but it is misleading whenever classes are imbalanced (see class imbalance and resampling for mitigation strategies). Precision, recall, F1, ROC AUC, and precision-recall AUC each answer different questions. Choosing the wrong metric can lead you to deploy a model that fails on the task that actually matters.
The Confusion Matrix
Confusion Matrix
For a binary classifier with classes "positive" (P) and "negative" (N), the confusion matrix has four entries:
| Predicted P | Predicted N | |
|---|---|---|
| Actual P | True Positive (TP) | False Negative (FN) |
| Actual N | False Positive (FP) | True Negative (TN) |
Every test example falls into exactly one cell. All classification metrics are functions of these four counts.
Core Metrics
Accuracy
The fraction of all predictions that are correct.
Precision
Of all examples predicted positive, what fraction is truly positive? Also called positive predictive value. Answers: "when the model says positive, how often is it right?"
Recall
Of all truly positive examples, what fraction does the model identify? Also called sensitivity or true positive rate. Answers: "of all actual positives, how many does the model find?"
Specificity
Of all truly negative examples, what fraction does the model correctly classify as negative? Also called true negative rate. Answers: "of all actual negatives, how many does the model correctly reject?"
False Positive Rate
The fraction of truly negative examples that the model incorrectly classifies as positive.
F1 Score
F1 Score
The harmonic mean of precision and recall.
F1 as Harmonic Mean
Statement
The F1 score equals the harmonic mean of precision and recall :
The harmonic mean satisfies with equality when . If either or is zero, .
Intuition
The harmonic mean punishes imbalance between precision and recall. If precision is 1.0 and recall is 0.01, the arithmetic mean is 0.505 (looks fine), but the harmonic mean is 0.0198 (correctly indicates a bad classifier). You cannot get a high F1 without both precision and recall being high.
Proof Sketch
The inequality follows from the harmonic mean being at most the geometric mean, which is at most the minimum. Alternatively, . When : , and trivially by a symmetric argument.
Why It Matters
F1 is the standard metric for imbalanced classification tasks where both false positives and false negatives matter. It is used in NER, information extraction, medical diagnosis, and fraud detection. The generalization lets you weight precision vs recall: .
Failure Mode
F1 ignores true negatives entirely. For tasks where correctly identifying negatives matters (e.g., screening out safe emails), F1 is incomplete. Also, F1 treats precision and recall as equally important. If one matters more, use or optimize precision/recall directly.
ROC Curve and AUC
ROC Curve
The Receiver Operating Characteristic curve plots true positive rate (recall) on the y-axis against false positive rate on the x-axis, as the classification threshold varies from to . Each threshold produces a point .
AUC as Ranking Probability
Statement
The area under the ROC curve (AUC) equals the probability that a randomly chosen positive example receives a higher score than a randomly chosen negative example:
where is drawn from the positive class and from the negative class.
Intuition
AUC measures ranking quality, not calibration. A model with AUC = 0.9 correctly ranks a random positive above a random negative 90% of the time. AUC = 0.5 is random guessing. AUC = 1.0 is perfect separation.
Proof Sketch
The ROC curve can be written as and . Integrating with respect to over all thresholds yields . By a change of variable argument (see Fawcett 2006), this equals , which is the Wilcoxon-Mann-Whitney statistic.
Why It Matters
AUC is threshold-independent: it evaluates the model's ability to rank, regardless of the specific threshold chosen for classification. This makes it useful for comparing models when the operating point has not been decided.
Failure Mode
AUC can be misleading for highly imbalanced datasets. A model that achieves very low precision at all recall levels can still have high AUC if it ranks most positives above most negatives. In such cases, the precision-recall curve is more informative because it focuses on the positive class.
Precision-Recall Curve
The precision-recall (PR) curve plots precision on the y-axis against recall on the x-axis as the threshold varies. Unlike ROC curves, PR curves are sensitive to class imbalance.
Key properties:
- A random classifier on a dataset with fraction positives gives a horizontal line at precision
- The PR AUC (average precision) is a better summary than ROC AUC when the positive class is rare
- PR curves make poor classifiers visible: low precision at moderate recall stands out
When to Use Each Metric
| Metric | Use when... |
|---|---|
| Accuracy | Classes are balanced and errors are equally costly |
| Precision | False positives are expensive (spam filtering, recommender systems) |
| Recall | False negatives are expensive (disease screening, fraud detection) |
| F1 | Both false positives and false negatives matter equally |
| ROC AUC | Comparing models across all thresholds with balanced classes |
| PR AUC | Evaluating on highly imbalanced datasets |
Common Confusions
High accuracy means a good model
A model predicting the majority class for every input achieves accuracy equal to the majority class proportion. On a dataset with 99% negatives, always predicting "negative" gives 99% accuracy with zero recall on the positive class.
ROC AUC is always the right summary metric
ROC AUC can look high even when precision is low at all useful recall levels. For imbalanced datasets, precision-recall AUC gives a more honest picture of model quality on the minority class.
Precision and recall are independent
They are not. For a fixed model, changing the classification threshold increases one at the expense of the other. A lower threshold catches more positives (higher recall) but also includes more false positives (lower precision). This tradeoff is the entire content of the PR curve.
Canonical Examples
Medical screening with 2% prevalence
1000 patients, 20 positive, 980 negative. A classifier has 90% recall and 95% specificity.
- TP = 18, FN = 2, FP = 49, TN = 931
- Precision = 18/67 = 26.9%
- Recall = 18/20 = 90%
- F1 = 2(0.269)(0.9)/(0.269 + 0.9) = 0.414
- Accuracy = 949/1000 = 94.9%
Accuracy looks good. Precision reveals that 73% of positive predictions are wrong.
Exercises
Problem
A classifier produces the following confusion matrix on a test set of 500 examples: TP = 40, FP = 10, FN = 20, TN = 430. Compute accuracy, precision, recall, specificity, and F1.
Problem
Prove that for any classifier, if you negate all predictions (swap positive and negative), the ROC curve is reflected through the point . What happens to the AUC?
References
Canonical:
- Fawcett, "An Introduction to ROC Analysis", Pattern Recognition Letters (2006)
- Manning, Raghavan, Schutze, Introduction to Information Retrieval (2008), Chapter 8
Current:
-
Saito & Rehmsmeier, "The Precision-Recall Plot Is More Informative than the ROC Plot", PLOS ONE (2015)
-
Lipton, Elkan, Naryanaswamy, "Optimal Thresholding of Classifiers to Maximize F1 Measure", ECML PKDD (2014)
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8
-
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14
Next Topics
- Base-rate fallacy: why precision drops dramatically with rare positive classes
- Cross-validation theory: how to reliably estimate these metrics from finite data
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A