Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Methodology

Evaluation Metrics and Properties

The metrics that determine whether a model is good: accuracy, precision, recall, F1, AUC-ROC, log loss, RMSE, calibration, and proper scoring rules. Why choosing the right metric matters more than improving the wrong one.

CoreTier 2Stable~50 min
0

Why This Matters

A model is only as good as the metric you use to evaluate it. If you optimize the wrong metric, you get a model that looks good on paper but fails in practice. A spam filter with 99% accuracy sounds great until you learn that 99% of emails are not spam and the filter just labels everything "not spam."

Choosing the right metric is a modeling decision. It encodes what you care about. Getting this wrong wastes all downstream effort.

Mental Model

Every metric answers a specific question about your model. Accuracy asks "how often is the model right?" Precision asks "when the model says positive, how often is it correct?" Recall asks "of all the true positives, how many does the model catch?" AUC asks "can the model rank positives above negatives?" Log loss asks "how well-calibrated are the predicted probabilities?"

Different questions matter for different tasks. A cancer screening test needs high recall (do not miss cancer). A content recommendation system needs high precision (do not recommend bad content). A weather forecast needs good calibration (when you say 70% chance of rain, it should rain 70% of the time).

Classification Metrics

Definition

Confusion Matrix

For binary classification with classes positive (PP) and negative (NN), the confusion matrix counts four outcomes:

  • True Positives (TP): model says positive, truth is positive
  • False Positives (FP): model says positive, truth is negative (Type I error)
  • False Negatives (FN): model says negative, truth is positive (Type II error)
  • True Negatives (TN): model says negative, truth is negative

All classification metrics are functions of these four counts.

Definition

Accuracy

Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

The fraction of predictions that are correct. Simple, intuitive, and often misleading. When classes are imbalanced, accuracy is dominated by the majority class.

Definition

Precision and Recall

Precision=TPTP+FP,Recall=TPTP+FN\text{Precision} = \frac{TP}{TP + FP}, \qquad \text{Recall} = \frac{TP}{TP + FN}

Precision measures the purity of positive predictions. Recall (also called sensitivity or true positive rate) measures the completeness of positive detection. There is an inherent tension: increasing the threshold for predicting positive raises precision but lowers recall.

Definition

F1 Score

F1=2PrecisionRecallPrecision+Recall=2TP2TP+FP+FNF_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2 \, TP}{2 \, TP + FP + FN}

The F1 score is the harmonic mean of precision and recall. It is zero if either precision or recall is zero, and it equals one only when both are perfect. The harmonic mean penalizes extreme imbalance between precision and recall more than the arithmetic mean would.

More generally, the FβF_\beta score weights recall β\beta times more than precision: Fβ=(1+β2)PRβ2P+RF_\beta = (1 + \beta^2) \frac{P \cdot R}{\beta^2 P + R}.

Threshold-Free Metrics

Definition

ROC Curve and AUC

The ROC curve (Receiver Operating Characteristic) plots the True Positive Rate (recall) against the False Positive Rate (FPR=FP/(FP+TN)FPR = FP / (FP + TN)) as the classification threshold varies from ++\infty to -\infty.

The AUC (Area Under the ROC Curve) summarizes discrimination ability in a single number between 0 and 1. AUC equals the probability that a randomly chosen positive example is scored higher than a randomly chosen negative example:

AUC=P(s(x+)>s(x))\text{AUC} = P(s(x^+) > s(x^-))

where ss is the model's score function. AUC = 0.5 means random guessing; AUC = 1.0 means perfect ranking.

AUC is threshold-independent and base-rate independent. It measures how well the model ranks examples, not how well it calibrates probabilities. A model with AUC = 0.95 might still produce terrible probability estimates.

Probabilistic Metrics

Definition

Log Loss (Cross-Entropy)

For binary classification with predicted probability p^i\hat{p}_i for the positive class and true label yi{0,1}y_i \in \{0, 1\}:

Log Loss=1ni=1n[yilogp^i+(1yi)log(1p^i)]\text{Log Loss} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log \hat{p}_i + (1 - y_i) \log(1 - \hat{p}_i) \right]

Log loss heavily penalizes confident wrong predictions. If the true label is 1 and you predict p^=0.01\hat{p} = 0.01, you incur a loss of log(0.01)4.6-\log(0.01) \approx 4.6. If you predict p^=0.49\hat{p} = 0.49, the loss is only log(0.49)0.71-\log(0.49) \approx 0.71.

Definition

Brier Score

Brier=1ni=1n(p^iyi)2\text{Brier} = \frac{1}{n} \sum_{i=1}^{n} (\hat{p}_i - y_i)^2

The Brier score is the mean squared error of predicted probabilities. It ranges from 0 (perfect) to 1 (worst). Unlike log loss, the Brier score does not blow up for confident wrong predictions: it is bounded.

Regression Metrics

Definition

RMSE and MAE

RMSE=1ni=1n(yiy^i)2,MAE=1ni=1nyiy^i\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}, \qquad \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|

RMSE penalizes large errors more heavily (quadratic penalty). MAE treats all errors linearly and is more robust to outliers. Minimizing RMSE yields the conditional mean; minimizing MAE yields the conditional median.

Definition

R-squared

R2=1i=1n(yiy^i)2i=1n(yiyˉ)2R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}

R2R^2 measures the fraction of variance explained by the model. R2=1R^2 = 1 means the model predicts perfectly. R2=0R^2 = 0 means the model is no better than predicting the mean. R2R^2 can be negative if the model is worse than the mean.

Calibration

Definition

Calibration

A model is calibrated if when it predicts probability pp, the event actually occurs with frequency pp. Formally, for all p[0,1]p \in [0, 1]:

P(Y=1p^(X)=p)=pP(Y = 1 \mid \hat{p}(X) = p) = p

Calibration is checked via a reliability diagram: bin predictions by predicted probability, plot the mean predicted probability versus the actual frequency in each bin. A perfectly calibrated model lies on the diagonal.

A model can have excellent AUC (good ranking) but poor calibration (the predicted probabilities do not match actual frequencies). Calibration matters when you use predicted probabilities for decision-making: resource allocation, insurance pricing, medical risk assessment.

Proper Scoring Rules

Theorem

Proper Scoring Rules Encourage Honest Probabilities

Statement

A scoring rule S(p,y)S(p, y) is proper if the expected score is maximized when the forecaster reports the true distribution pp^*:

EYp[S(p,Y)]EYp[S(q,Y)]for all q\mathbb{E}_{Y \sim p^*}[S(p^*, Y)] \geq \mathbb{E}_{Y \sim p^*}[S(q, Y)] \quad \text{for all } q

It is strictly proper if equality holds only when q=pq = p^*. Both log loss (negative cross-entropy) and the Brier score are strictly proper scoring rules. Accuracy is not a proper scoring rule.

Intuition

A proper scoring rule means you cannot game the metric by reporting probabilities different from your true beliefs. If you think the probability of rain is 70%, reporting 70% maximizes your expected score under a proper rule. Reporting 90% to seem more confident (or 50% to hedge) makes your expected score worse.

Proof Sketch

For the Brier score, the expected score when truth is pp^* and you report qq is E[(qY)2]=q22qp+p=(qp)2+p(1p)\mathbb{E}[(q - Y)^2] = q^2 - 2qp^* + p^* = (q - p^*)^2 + p^*(1-p^*). This is minimized when q=pq = p^*.

For log loss, E[plogq(1p)log(1q)]\mathbb{E}[-p^* \log q - (1-p^*)\log(1-q)] is the cross-entropy H(p,q)H(p^*, q), which by Gibbs' inequality is minimized when q=pq = p^*.

Why It Matters

Using a proper scoring rule means that the model's optimal strategy is to output true probabilities. If you train with log loss, the model is incentivized to produce calibrated probabilities. If you evaluate with accuracy, the model is incentivized to be overconfident. every prediction is pushed toward 0 or 1, regardless of the true uncertainty.

Failure Mode

Even strictly proper scoring rules do not guarantee calibration in finite samples. A model trained with log loss may still be miscalibrated due to model misspecification, limited data, or optimization issues. Post-hoc calibration (Platt scaling, isotonic regression) is often needed in practice.

Why Accuracy Is Often the Wrong Metric

Consider a medical screening test for a disease affecting 1% of the population. A model that always predicts "no disease" achieves 99% accuracy. Yet it catches zero cases. This model has perfect specificity and zero recall.

The problem is that accuracy weights all errors equally. In imbalanced settings, the majority class dominates. Better alternatives:

  • Precision/recall/F1 when you care about the minority class
  • AUC when you need threshold-independent discrimination
  • Log loss or Brier score when you need calibrated probabilities
  • Cost-sensitive metrics when false positives and false negatives have different real-world costs

How to Choose the Right Metric

  1. What decision does the model support? If it is a hard yes/no decision, use F1 or a cost-sensitive metric. If probabilities matter, use log loss or Brier score.
  2. Is the dataset balanced? If not, avoid accuracy. Use F1, AUC, or precision-recall AUC.
  3. Do you need ranking or calibration? AUC measures ranking. Log loss and Brier score measure calibration. These are different properties.
  4. What are the costs of errors? If false negatives are catastrophic (cancer screening), optimize recall at a fixed precision threshold. If false positives are costly (fraud flagging), optimize precision.

Common Confusions

Watch Out

High AUC does not mean good calibration

AUC measures whether positives are ranked above negatives. A model can achieve AUC = 0.99 while predicting probabilities that are systematically too high or too low. If you need probabilities (not just rankings), check the reliability diagram and Brier score separately.

Watch Out

F1 depends on the threshold but is often reported at a single threshold

F1 is defined for a specific classification threshold. Changing the threshold changes precision and recall, and therefore F1. When papers report "F1 = 0.85" they typically mean the F1 at the default threshold of 0.5. The optimal threshold for F1 is generally not 0.5. Always report the threshold or use the precision-recall curve.

Watch Out

R-squared can be negative

R2<0R^2 < 0 means the model predicts worse than the constant-mean baseline. This can happen with a badly overfitted model evaluated on the test set, or when applying a model to a distribution different from training. Negative R2R^2 does not mean the math is broken. It means the model is worse than useless.

Summary

  • Accuracy is misleading for imbalanced datasets; prefer F1, AUC, or log loss
  • Precision and recall trade off; F1 balances them via the harmonic mean
  • AUC measures ranking ability, not calibration
  • Log loss and Brier score are strictly proper scoring rules: they incentivize honest probability estimates
  • RMSE penalizes large errors more than MAE; RMSE optimizes the mean, MAE optimizes the median
  • Calibration and discrimination are different properties; a model can be good at one and bad at the other
  • Always choose your metric based on the real-world decision the model supports

Exercises

ExerciseCore

Problem

A binary classifier on a dataset with 950 negatives and 50 positives predicts all examples as negative. Compute: accuracy, precision (define as 0 if undefined), recall, and F1.

ExerciseAdvanced

Problem

Prove that the Brier score is a strictly proper scoring rule. That is, show that for a Bernoulli outcome with true probability pp^*, the expected Brier score EYBernoulli(p)[(qY)2]\mathbb{E}_{Y \sim \text{Bernoulli}(p^*)}[(q - Y)^2] is uniquely minimized at q=pq = p^*.

ExerciseResearch

Problem

A model has AUC = 0.95 but its reliability diagram shows it is badly miscalibrated: it overestimates probabilities for rare events. Describe a post-hoc calibration method that fixes this without changing the ranking (preserving AUC). What assumptions does it require?

References

Canonical:

  • Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Chapter 7
  • Gneiting & Raftery, "Strictly Proper Scoring Rules, Prediction, and Estimation" (2007)

Current:

  • Ferri, Hernandez-Orallo, Modroiu, "An Experimental Comparison of Performance Measures for Classification" (2009)

  • Niculescu-Mizil & Caruana, "Predicting Good Probabilities with Supervised Learning" (ICML 2005)

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14

  • Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 5-7

Next Topics

The natural next steps from evaluation metrics:

Last reviewed: April 2026

Builds on This

Next Topics