Methodology
Evaluation Metrics and Properties
The metrics that determine whether a model is good: accuracy, precision, recall, F1, AUC-ROC, log loss, RMSE, calibration, and proper scoring rules. Why choosing the right metric matters more than improving the wrong one.
Why This Matters
A model is only as good as the metric you use to evaluate it. If you optimize the wrong metric, you get a model that looks good on paper but fails in practice. A spam filter with 99% accuracy sounds great until you learn that 99% of emails are not spam and the filter just labels everything "not spam."
Choosing the right metric is a modeling decision. It encodes what you care about. Getting this wrong wastes all downstream effort.
Mental Model
Every metric answers a specific question about your model. Accuracy asks "how often is the model right?" Precision asks "when the model says positive, how often is it correct?" Recall asks "of all the true positives, how many does the model catch?" AUC asks "can the model rank positives above negatives?" Log loss asks "how well-calibrated are the predicted probabilities?"
Different questions matter for different tasks. A cancer screening test needs high recall (do not miss cancer). A content recommendation system needs high precision (do not recommend bad content). A weather forecast needs good calibration (when you say 70% chance of rain, it should rain 70% of the time).
Classification Metrics
Confusion Matrix
For binary classification with classes positive () and negative (), the confusion matrix counts four outcomes:
- True Positives (TP): model says positive, truth is positive
- False Positives (FP): model says positive, truth is negative (Type I error)
- False Negatives (FN): model says negative, truth is positive (Type II error)
- True Negatives (TN): model says negative, truth is negative
All classification metrics are functions of these four counts.
Accuracy
The fraction of predictions that are correct. Simple, intuitive, and often misleading. When classes are imbalanced, accuracy is dominated by the majority class.
Precision and Recall
Precision measures the purity of positive predictions. Recall (also called sensitivity or true positive rate) measures the completeness of positive detection. There is an inherent tension: increasing the threshold for predicting positive raises precision but lowers recall.
F1 Score
The F1 score is the harmonic mean of precision and recall. It is zero if either precision or recall is zero, and it equals one only when both are perfect. The harmonic mean penalizes extreme imbalance between precision and recall more than the arithmetic mean would.
More generally, the score weights recall times more than precision: .
Threshold-Free Metrics
ROC Curve and AUC
The ROC curve (Receiver Operating Characteristic) plots the True Positive Rate (recall) against the False Positive Rate () as the classification threshold varies from to .
The AUC (Area Under the ROC Curve) summarizes discrimination ability in a single number between 0 and 1. AUC equals the probability that a randomly chosen positive example is scored higher than a randomly chosen negative example:
where is the model's score function. AUC = 0.5 means random guessing; AUC = 1.0 means perfect ranking.
AUC is threshold-independent and base-rate independent. It measures how well the model ranks examples, not how well it calibrates probabilities. A model with AUC = 0.95 might still produce terrible probability estimates.
Probabilistic Metrics
Log Loss (Cross-Entropy)
For binary classification with predicted probability for the positive class and true label :
Log loss heavily penalizes confident wrong predictions. If the true label is 1 and you predict , you incur a loss of . If you predict , the loss is only .
Brier Score
The Brier score is the mean squared error of predicted probabilities. It ranges from 0 (perfect) to 1 (worst). Unlike log loss, the Brier score does not blow up for confident wrong predictions: it is bounded.
Regression Metrics
RMSE and MAE
RMSE penalizes large errors more heavily (quadratic penalty). MAE treats all errors linearly and is more robust to outliers. Minimizing RMSE yields the conditional mean; minimizing MAE yields the conditional median.
R-squared
measures the fraction of variance explained by the model. means the model predicts perfectly. means the model is no better than predicting the mean. can be negative if the model is worse than the mean.
Calibration
Calibration
A model is calibrated if when it predicts probability , the event actually occurs with frequency . Formally, for all :
Calibration is checked via a reliability diagram: bin predictions by predicted probability, plot the mean predicted probability versus the actual frequency in each bin. A perfectly calibrated model lies on the diagonal.
A model can have excellent AUC (good ranking) but poor calibration (the predicted probabilities do not match actual frequencies). Calibration matters when you use predicted probabilities for decision-making: resource allocation, insurance pricing, medical risk assessment.
Proper Scoring Rules
Proper Scoring Rules Encourage Honest Probabilities
Statement
A scoring rule is proper if the expected score is maximized when the forecaster reports the true distribution :
It is strictly proper if equality holds only when . Both log loss (negative cross-entropy) and the Brier score are strictly proper scoring rules. Accuracy is not a proper scoring rule.
Intuition
A proper scoring rule means you cannot game the metric by reporting probabilities different from your true beliefs. If you think the probability of rain is 70%, reporting 70% maximizes your expected score under a proper rule. Reporting 90% to seem more confident (or 50% to hedge) makes your expected score worse.
Proof Sketch
For the Brier score, the expected score when truth is and you report is . This is minimized when .
For log loss, is the cross-entropy , which by Gibbs' inequality is minimized when .
Why It Matters
Using a proper scoring rule means that the model's optimal strategy is to output true probabilities. If you train with log loss, the model is incentivized to produce calibrated probabilities. If you evaluate with accuracy, the model is incentivized to be overconfident. every prediction is pushed toward 0 or 1, regardless of the true uncertainty.
Failure Mode
Even strictly proper scoring rules do not guarantee calibration in finite samples. A model trained with log loss may still be miscalibrated due to model misspecification, limited data, or optimization issues. Post-hoc calibration (Platt scaling, isotonic regression) is often needed in practice.
Why Accuracy Is Often the Wrong Metric
Consider a medical screening test for a disease affecting 1% of the population. A model that always predicts "no disease" achieves 99% accuracy. Yet it catches zero cases. This model has perfect specificity and zero recall.
The problem is that accuracy weights all errors equally. In imbalanced settings, the majority class dominates. Better alternatives:
- Precision/recall/F1 when you care about the minority class
- AUC when you need threshold-independent discrimination
- Log loss or Brier score when you need calibrated probabilities
- Cost-sensitive metrics when false positives and false negatives have different real-world costs
How to Choose the Right Metric
- What decision does the model support? If it is a hard yes/no decision, use F1 or a cost-sensitive metric. If probabilities matter, use log loss or Brier score.
- Is the dataset balanced? If not, avoid accuracy. Use F1, AUC, or precision-recall AUC.
- Do you need ranking or calibration? AUC measures ranking. Log loss and Brier score measure calibration. These are different properties.
- What are the costs of errors? If false negatives are catastrophic (cancer screening), optimize recall at a fixed precision threshold. If false positives are costly (fraud flagging), optimize precision.
Common Confusions
High AUC does not mean good calibration
AUC measures whether positives are ranked above negatives. A model can achieve AUC = 0.99 while predicting probabilities that are systematically too high or too low. If you need probabilities (not just rankings), check the reliability diagram and Brier score separately.
F1 depends on the threshold but is often reported at a single threshold
F1 is defined for a specific classification threshold. Changing the threshold changes precision and recall, and therefore F1. When papers report "F1 = 0.85" they typically mean the F1 at the default threshold of 0.5. The optimal threshold for F1 is generally not 0.5. Always report the threshold or use the precision-recall curve.
R-squared can be negative
means the model predicts worse than the constant-mean baseline. This can happen with a badly overfitted model evaluated on the test set, or when applying a model to a distribution different from training. Negative does not mean the math is broken. It means the model is worse than useless.
Summary
- Accuracy is misleading for imbalanced datasets; prefer F1, AUC, or log loss
- Precision and recall trade off; F1 balances them via the harmonic mean
- AUC measures ranking ability, not calibration
- Log loss and Brier score are strictly proper scoring rules: they incentivize honest probability estimates
- RMSE penalizes large errors more than MAE; RMSE optimizes the mean, MAE optimizes the median
- Calibration and discrimination are different properties; a model can be good at one and bad at the other
- Always choose your metric based on the real-world decision the model supports
Exercises
Problem
A binary classifier on a dataset with 950 negatives and 50 positives predicts all examples as negative. Compute: accuracy, precision (define as 0 if undefined), recall, and F1.
Problem
Prove that the Brier score is a strictly proper scoring rule. That is, show that for a Bernoulli outcome with true probability , the expected Brier score is uniquely minimized at .
Problem
A model has AUC = 0.95 but its reliability diagram shows it is badly miscalibrated: it overestimates probabilities for rare events. Describe a post-hoc calibration method that fixes this without changing the ranking (preserving AUC). What assumptions does it require?
References
Canonical:
- Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Chapter 7
- Gneiting & Raftery, "Strictly Proper Scoring Rules, Prediction, and Estimation" (2007)
Current:
-
Ferri, Hernandez-Orallo, Modroiu, "An Experimental Comparison of Performance Measures for Classification" (2009)
-
Niculescu-Mizil & Caruana, "Predicting Good Probabilities with Supervised Learning" (ICML 2005)
-
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14
-
Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 5-7
Next Topics
The natural next steps from evaluation metrics:
- Benchmarking methodology: how to design fair comparisons using these metrics
- Cross-validation theory: how to estimate metrics reliably from limited data
- Hypothesis testing for ML: how to determine if metric differences are statistically significant
Last reviewed: April 2026
Builds on This
- Benchmarking MethodologyLayer 3
- Proper Scoring RulesLayer 2