Evaluation Metrics and Properties

Sneiderman, Robby

Methodology

Evaluation Metrics and Properties

The metrics that determine whether a model is good: accuracy, precision, recall, F1, AUC-ROC, log loss, RMSE, calibration, and proper scoring rules. Why choosing the right metric matters more than improving the wrong one.

CoreTier 2StableSupporting~50 min

Start 8-question practice · 1 available Prereq Map

Learning position

Read this page in the graph.

methodology | layer 2 | tier 2. This page has 0 direct prerequisites and 4 published dependents.

Open Atlas Prerequisites Leads to

What next

Benchmarking Methodology

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

A model is only as good as the metric you use to evaluate it. If you optimize the wrong metric, you get a model that looks good on paper but fails in practice. A spam filter with 99% accuracy sounds great until you learn that 99% of emails are not spam and the filter just labels everything "not spam."

Choosing the right metric is a modeling decision. It encodes what you care about. Getting this wrong wastes all downstream effort.

Mental Model

Every metric answers a specific question about your model. Accuracy asks "how often is the model right?" Precision asks "when the model says positive, how often is it correct?" Recall asks "of all the true positives, how many does the model catch?" AUC asks "can the model rank positives above negatives?" Log loss and the Brier score ask "how good are the predicted probabilities overall?": they are proper scoring rules that jointly reward calibration and sharpness (also called refinement or resolution). Calibration alone is measured by reliability diagrams, ECE, or the calibration component of a Brier decomposition.

Different questions matter for different tasks. A cancer screening test needs high recall (do not miss cancer). A content recommendation system needs high precision (do not recommend bad content). A weather forecast needs good calibration (when you say 70% chance of rain, it should rain 70% of the time).

Classification Metrics

Definition

Confusion Matrix

For binary classification with classes positive ( $P$ ) and negative ( $N$ ), the confusion matrix counts four outcomes:

True Positives (TP): model says positive, truth is positive
False Positives (FP): model says positive, truth is negative (Type I error)
False Negatives (FN): model says negative, truth is positive (Type II error)
True Negatives (TN): model says negative, truth is negative

All classification metrics are functions of these four counts.

Definition

Accuracy

$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$

The fraction of predictions that are correct. Simple, intuitive, and often misleading. When classes are imbalanced, accuracy is dominated by the majority class.

Definition

Precision and Recall

$\text{Precision} = \frac{TP}{TP + FP}, \qquad \text{Recall} = \frac{TP}{TP + FN}$

Precision measures the purity of positive predictions. Recall (also called sensitivity or true positive rate) measures the completeness of positive detection. There is an inherent tension: increasing the threshold for predicting positive raises precision but lowers recall.

Definition

F1 Score

$F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2 \, TP}{2 \, TP + FP + FN}$

The F1 score is the harmonic mean of precision and recall. It is zero if and only if either precision or recall is zero, and it equals one only when both are perfect. The harmonic mean penalizes extreme imbalance between precision and recall more than the arithmetic mean would.

More generally, the $F_\beta$ score weights recall $\beta$ times more than precision: $F_\beta = (1 + \beta^2) \frac{P \cdot R}{\beta^2 P + R}$ .

Threshold-Free Metrics

Definition

ROC Curve and AUC

The ROC curve (Receiver Operating Characteristic) plots the True Positive Rate (recall) against the False Positive Rate ( $FPR = FP / (FP + TN)$ ) as the classification threshold varies from $+\infty$ to $-\infty$ .

The AUC (Area Under the ROC Curve) summarizes discrimination ability in a single number between 0 and 1. AUC equals the probability that a randomly chosen positive example is scored higher than a randomly chosen negative example:

$\text{AUC} = P(s(x^+) > s(x^-))$

where $s$ is the model's score function. AUC = 0.5 means random guessing; AUC = 1.0 means perfect ranking.

AUC is threshold-independent and base-rate independent. It measures how well the model ranks examples, not how well it calibrates probabilities. A model with AUC = 0.95 might still produce terrible probability estimates.

Probabilistic Metrics

Log loss and the Brier score are strictly proper scoring rules: they evaluate the full quality of a probabilistic forecast, not just its calibration. Both decompose into a calibration term (do predicted probabilities match conditional frequencies?) plus a sharpness/refinement term (how informative is the conditional distribution given the prediction?). A constant predictor that outputs the marginal base rate is perfectly calibrated but has no sharpness, so it scores worse than a model that predicts class-conditional probabilities correctly. Calibration-specific diagnostics (reliability diagrams, ECE, the calibration component of the Murphy/Brier decomposition) are needed when calibration alone is the question.

Definition

Log Loss (Cross-Entropy)

For binary classification with predicted probability $\hat{p}_i$ for the positive class and true label $y_i \in \{0, 1\}$ :

$\text{Log Loss} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log \hat{p}_i + (1 - y_i) \log(1 - \hat{p}_i) \right]$

Log loss heavily penalizes confident wrong predictions. If the true label is 1 and you predict $\hat{p} = 0.01$ , you incur a loss of $-\log(0.01) \approx 4.6$ . If you predict $\hat{p} = 0.49$ , the loss is only $-\log(0.49) \approx 0.71$ .

Definition

Brier Score

$\text{Brier} = \frac{1}{n} \sum_{i=1}^{n} (\hat{p}_i - y_i)^2$

The Brier score is the mean squared error of predicted probabilities. It ranges from 0 (perfect) to 1 (worst). Unlike log loss, the Brier score does not blow up for confident wrong predictions: it is bounded.

Regression Metrics

Definition

RMSE and MAE

$\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}, \qquad \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$

RMSE penalizes large errors more heavily (quadratic penalty). MAE treats all errors linearly and is more robust to outliers. Minimizing RMSE yields the conditional mean; minimizing MAE yields the conditional median.

Definition

R-squared $R^{2}$

$R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}$

$R^2$ measures the fraction of variance explained by the model. $R^2 = 1$ means the model predicts perfectly. $R^2 = 0$ means the model is no better than predicting the mean. $R^2$ can be negative if the model is worse than the mean.

Calibration

Definition

Calibration

A model is calibrated if and only if when it predicts probability $p$ , the event actually occurs with frequency $p$ . Formally, for all $p \in [0, 1]$ :

$P(Y = 1 \mid \hat{p}(X) = p) = p$

Calibration is checked via a reliability diagram: bin predictions by predicted probability, plot the mean predicted probability versus the actual frequency in each bin. A perfectly calibrated model lies on the diagonal.

A model can have excellent AUC (good ranking) but poor calibration (the predicted probabilities do not match actual frequencies). Calibration matters when you use predicted probabilities for decision-making: resource allocation, insurance pricing, medical risk assessment.

Proper Scoring Rules

Theorem

Proper Scoring Rules Encourage Honest Probabilities

Statement

A scoring rule $S(p, y)$ is proper if and only if the expected score is maximized when the forecaster reports the true distribution $p^*$ :

$\mathbb{E}_{Y \sim p^*}[S(p^*, Y)] \geq \mathbb{E}_{Y \sim p^*}[S(q, Y)] \quad \text{for all } q$

It is strictly proper if and only if equality holds only when $q = p^*$ . Both log loss (negative cross-entropy) and the Brier score are strictly proper scoring rules. Accuracy is not a proper scoring rule.

Intuition

A proper scoring rule means you cannot game the metric by reporting probabilities different from your true beliefs. If you think the probability of rain is 70%, reporting 70% maximizes your expected score under a proper rule. Reporting 90% to seem more confident (or 50% to hedge) makes your expected score worse.

Proof Sketch

For the Brier score, the expected score when truth is $p^*$ and you report $q$ is $\mathbb{E}[(q - Y)^2] = q^2 - 2qp^* + p^* = (q - p^*)^2 + p^*(1-p^*)$ . This is minimized when $q = p^*$ .

For log loss, $\mathbb{E}[-p^* \log q - (1-p^*)\log(1-q)]$ is the cross-entropy $H(p^*, q)$ , which by Gibbs' inequality is minimized when $q = p^*$ .

Why It Matters

Using a proper scoring rule means that the model's optimal strategy is to output true probabilities. If you train with log loss, the model is incentivized to produce calibrated probabilities. If you evaluate with accuracy, the model is incentivized to be overconfident. every prediction is pushed toward 0 or 1, regardless of the true uncertainty.

Failure Mode

Even strictly proper scoring rules do not guarantee calibration in finite samples. A model trained with log loss may still be miscalibrated due to model misspecification, limited data, or optimization issues. Post-hoc calibration (Platt scaling, isotonic regression) is often needed in practice.

report a correction →

Why Accuracy Is Often the Wrong Metric

Consider a medical screening test for a disease affecting 1% of the population. A model that always predicts "no disease" achieves 99% accuracy. Yet it catches zero cases. This model has perfect specificity and zero recall.

The problem is that accuracy weights all errors equally. In imbalanced settings, the majority class dominates. Better alternatives:

Precision/recall/F1 when you care about the minority class
AUC when you need threshold-independent discrimination
Log loss or Brier score when you need overall probabilistic forecast quality (calibration and sharpness); pair with reliability diagrams or ECE if you specifically want to audit calibration in isolation
Cost-sensitive metrics when false positives and false negatives have different real-world costs

How to Choose the Right Metric

What decision does the model support? If it is a hard yes/no decision, use F1 or a cost-sensitive metric. If probabilities matter, use log loss or Brier score.
Is the dataset balanced? If not, avoid accuracy. Use F1, AUC, or precision-recall AUC.
Do you need ranking, full forecast quality, or calibration alone? AUC measures ranking. Log loss and Brier score measure overall probabilistic forecast quality (calibration plus sharpness). Calibration in isolation is measured by reliability diagrams, ECE, or the calibration term of the Brier decomposition. These are distinct properties.
What are the costs of errors? If false negatives are catastrophic (cancer screening), optimize recall at a fixed precision threshold. If false positives are costly (fraud flagging), optimize precision.

Common Confusions

Watch Out

High AUC does not mean good calibration

AUC measures whether positives are ranked above negatives. A model can achieve AUC = 0.99 while predicting probabilities that are systematically too high or too low. If you need probabilities (not just rankings), check the reliability diagram and Brier score separately.

Watch Out

F1 depends on the threshold but is often reported at a single threshold

F1 is defined for a specific classification threshold. Changing the threshold changes precision and recall, and therefore F1. When papers report "F1 = 0.85" they typically mean the F1 at the default threshold of 0.5. The optimal threshold for F1 is generally not 0.5. Always report the threshold or use the precision-recall curve.

Watch Out

R-squared can be negative

$R^2 < 0$ means the model predicts worse than the constant-mean baseline. This can happen with a badly overfitted model evaluated on the test set, or when applying a model to a distribution different from training. Negative $R^2$ does not mean the math is broken. It means the model is worse than useless.

Summary

Accuracy is misleading for imbalanced datasets; prefer F1, AUC, or log loss
Precision and recall trade off; F1 balances them via the harmonic mean
AUC measures ranking ability, not calibration
Log loss and Brier score are strictly proper scoring rules: they incentivize honest probability estimates
RMSE penalizes large errors more than MAE; RMSE optimizes the mean, MAE optimizes the median
Calibration and discrimination are different properties; a model can be good at one and bad at the other
Always choose your metric based on the real-world decision the model supports

Exercises

ExerciseCore

Problem

A binary classifier on a dataset with 950 negatives and 50 positives predicts all examples as negative. Compute: accuracy, precision (define as 0 if undefined), recall, and F1.

ExerciseAdvanced

Problem

Prove that the Brier score is a strictly proper scoring rule. That is, show that for a Bernoulli outcome with true probability $p^*$ , the expected Brier score $\mathbb{E}_{Y \sim \text{Bernoulli}(p^*)}[(q - Y)^2]$ is uniquely minimized at $q = p^*$ .

ExerciseResearch

Problem

A model has AUC = 0.95 but its reliability diagram shows it is badly miscalibrated: it overestimates probabilities for rare events. Describe a post-hoc calibration method that fixes this without changing the ranking (preserving AUC). What assumptions does it require?

References

Canonical:

Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Chapter 7
Gneiting & Raftery, "Strictly Proper Scoring Rules, Prediction, and Estimation" (2007)

Current:

Ferri, Hernandez-Orallo, Modroiu, "An Experimental Comparison of Performance Measures for Classification" (2009)
Niculescu-Mizil & Caruana, "Predicting Good Probabilities with Supervised Learning" (ICML 2005)
Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 5-7

Next Topics

The natural next steps from evaluation metrics:

Benchmarking methodology: how to design fair comparisons using these metrics
Cross-validation theory: how to estimate metrics reliably from limited data
Hypothesis testing for ML: how to determine if metric differences are statistically significant

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

0

No direct prerequisites are declared; this is treated as an entry point.

Derived topics

4

Cross-Validation Theorylayer 2 · tier 2
Hypothesis Testing for MLlayer 2 · tier 2
Proper Scoring Ruleslayer 2 · tier 2
Benchmarking Methodologylayer 3 · tier 3

Graph-backed continuations

Benchmarking Methodology Cross-Validation Theory Hypothesis Testing for ML Proper Scoring Rules