Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Methodology

Proper Scoring Rules

A scoring rule is proper if the expected score is maximized when the forecaster reports their true belief. Log score and Brier score are strictly proper. Accuracy is not. Why this matters for calibrated probability estimates.

AdvancedTier 2Stable~40 min
0

Why This Matters

When a model outputs a probability (e.g., "70% chance of rain"), you want that probability to mean something. A model that says 70% should be right about 70% of the time. The question is: does the loss function you use to train or evaluate the model incentivize this honest reporting?

A proper scoring rule says yes: the forecaster's expected score is maximized when they report their true belief. An improper scoring rule can be gamed: the forecaster can get a better expected score by reporting a probability different from their true belief.

Cross-entropy loss (log score) is strictly proper. Brier score (squared error on probabilities) is strictly proper. Classification accuracy is not proper. This distinction explains why models trained with cross-entropy produce calibrated probabilities, while models evaluated only on accuracy may not.

Mental Model

Imagine a weather forecaster who believes there is a 60% chance of rain. A proper scoring rule rewards her most if she reports 60%. If she reports 90% instead (to seem more decisive), her expected score is worse. If she reports 50% (hedging), her expected score is also worse. Only the truth maximizes expected score.

An improper rule might reward her more for reporting 100% or 0% (confident and sometimes right) than for reporting her honest 60%.

Formal Setup

Definition

Scoring Rule

A scoring rule assigns a numerical score S(p,y)S(p, y) to a probability forecast p[0,1]p \in [0,1] (or more generally pΔKp \in \Delta_K for KK classes) given the realized outcome yy. Higher scores indicate better forecasts (we follow the convention that scoring rules are to be maximized).

Definition

Proper Scoring Rule

A scoring rule SS is proper if for all true probabilities qq and all reported probabilities pp:

Eyq[S(q,y)]Eyq[S(p,y)]\mathbb{E}_{y \sim q}[S(q, y)] \geq \mathbb{E}_{y \sim q}[S(p, y)]

The forecaster maximizes expected score by reporting p=qp = q. The rule is strictly proper if equality holds only when p=qp = q.

Main Scoring Rules

Definition

Log Score

For binary outcomes with y{0,1}y \in \{0, 1\}:

Slog(p,y)=ylnp+(1y)ln(1p)S_{\log}(p, y) = y \ln p + (1 - y) \ln(1 - p)

The negative log score is the cross-entropy loss: Slog(p,y)=ylnp(1y)ln(1p)-S_{\log}(p, y) = -y \ln p - (1-y) \ln(1-p).

Definition

Brier Score

For binary outcomes:

SBrier(p,y)=(py)2S_{\text{Brier}}(p, y) = -(p - y)^2

This is the negative of the squared difference between the forecast probability and the outcome. Lower squared error means higher Brier score.

Main Theorems

Theorem

Log Score and Brier Score Are Strictly Proper

Statement

The log score SlogS_{\log} and the Brier score SBrierS_{\text{Brier}} are both strictly proper. For the log score:

Eyq[Slog(q,y)]Eyq[Slog(p,y)]=KL(qp)0\mathbb{E}_{y \sim q}[S_{\log}(q, y)] - \mathbb{E}_{y \sim q}[S_{\log}(p, y)] = \text{KL}(q \| p) \geq 0

with equality iff p=qp = q. For the Brier score:

Eyq[SBrier(q,y)]Eyq[SBrier(p,y)]=(pq)20\mathbb{E}_{y \sim q}[S_{\text{Brier}}(q, y)] - \mathbb{E}_{y \sim q}[S_{\text{Brier}}(p, y)] = (p - q)^2 \geq 0

with equality iff p=qp = q.

Intuition

For the log score, misreporting pqp \neq q incurs KL divergence as a penalty. Since KL divergence is non-negative and zero only when the distributions match, the log score is strictly proper. For the Brier score, the penalty is the squared distance between the report and the truth, which is zero only when they agree.

Proof Sketch

Log score: Eq[Slog(p,y)]=qlnp+(1q)ln(1p)\mathbb{E}_q[S_{\log}(p, y)] = q \ln p + (1-q) \ln(1-p). Subtracting from Eq[Slog(q,y)]=qlnq+(1q)ln(1q)\mathbb{E}_q[S_{\log}(q, y)] = q \ln q + (1-q) \ln(1-q) gives qln(q/p)+(1q)ln((1q)/(1p))=KL(qp)0q \ln(q/p) + (1-q) \ln((1-q)/(1-p)) = \text{KL}(q \| p) \geq 0 by Gibbs' inequality.

Brier score: Eq[SBrier(p,y)]=Eq[(py)2]=(p22pq+q)\mathbb{E}_q[S_{\text{Brier}}(p, y)] = -\mathbb{E}_q[(p-y)^2] = -(p^2 - 2pq + q). And Eq[SBrier(q,y)]=(q22q2+q)=(qq2)\mathbb{E}_q[S_{\text{Brier}}(q, y)] = -(q^2 - 2q^2 + q) = -(q - q^2). The difference is (qq2)+(p22pq+q)=p22pq+q2=(pq)2-(q - q^2) + (p^2 - 2pq + q) = p^2 - 2pq + q^2 = (p-q)^2.

Why It Matters

This theorem is why cross-entropy loss is the standard for training probabilistic classifiers. It is not just a convenient choice. It is the unique loss function (up to affine transformation) that makes the forecaster's optimal strategy to report their true conditional probabilities. Using a non-proper scoring rule (like accuracy) for training can lead to models that output overconfident or underconfident probabilities.

Failure Mode

The log score is unbounded: if the true label is y=1y = 1 and you predict p0p \approx 0, the log score is lnp\ln p \to -\infty. This makes the log score sensitive to confident wrong predictions. The Brier score is bounded in [1,0][-1, 0], making it more robust to outlier predictions but less sensitive to calibration differences near 0 and 1. In practice, label smoothing (replacing y=1y = 1 with y=0.9y = 0.9) is used to mitigate the unboundedness of the log score.

Proposition

Brier Score Decomposition

Statement

The average Brier score decomposes into three terms:

1Ni=1N(piyi)2=1Nk=1Knk(pˉkyˉk)2calibration+1Nk=1Knkyˉk(1yˉk)refinement (resolution)yˉ(1yˉ)uncertainty\frac{1}{N} \sum_{i=1}^{N} (p_i - y_i)^2 = \underbrace{\frac{1}{N} \sum_{k=1}^{K} n_k (\bar{p}_k - \bar{y}_k)^2}_{\text{calibration}} + \underbrace{\frac{1}{N} \sum_{k=1}^{K} n_k \bar{y}_k(1 - \bar{y}_k)}_{\text{refinement (resolution)}} - \underbrace{\bar{y}(1-\bar{y})}_{\text{uncertainty}}

where nkn_k is the number of forecasts in bin kk, pˉk\bar{p}_k is the average prediction in bin kk, yˉk\bar{y}_k is the average outcome in bin kk, and yˉ\bar{y} is the overall base rate. Calibration is minimized (at 0) when pˉk=yˉk\bar{p}_k = \bar{y}_k for all bins.

Intuition

The Brier score measures two things at once. Calibration: are your probabilities honest (when you say 70%, does it happen 70% of the time)? Resolution: do your probabilities vary meaningfully (always predicting the base rate is calibrated but useless). This decomposition separates the two. The uncertainty term depends only on the data, not the forecaster.

Proof Sketch

Group forecasts by bin. Write the Brier score as a sum of within-bin and between-bin terms. The within-bin variance of the outcomes gives the refinement term. The squared difference between bin-average predictions and bin-average outcomes gives the calibration term. The overall variance of outcomes gives the uncertainty term.

Why It Matters

This decomposition is used to diagnose whether a model's poor Brier score comes from miscalibration (fixable by post-hoc calibration methods like Platt scaling or isotonic regression) or from poor discrimination (requires a better model). It is the formal basis for reliability diagrams.

Failure Mode

The decomposition depends on the binning scheme. Different bin boundaries give different calibration and refinement values. With too few bins, calibration looks good even for miscalibrated models. With too many bins, each bin has too few samples for stable estimates.

Worked Scoring Examples

Example

Comparing log score and Brier score on three forecasters

Three weather forecasters predict rain probability for 5 days. The actual outcomes are: Rain, No Rain, Rain, Rain, No Rain.

Forecaster A (well-calibrated): reports 0.7, 0.3, 0.8, 0.6, 0.2. Forecaster B (overconfident): reports 0.95, 0.05, 0.95, 0.95, 0.05. Forecaster C (always 50/50): reports 0.5, 0.5, 0.5, 0.5, 0.5.

Log scores (higher is better):

  • A: [ln0.7+ln0.7+ln0.8+ln0.6+ln0.8]/5=[0.3570.3570.2230.5110.223]/5=0.334[\ln 0.7 + \ln 0.7 + \ln 0.8 + \ln 0.6 + \ln 0.8]/5 = [-0.357 - 0.357 - 0.223 - 0.511 - 0.223]/5 = -0.334
  • B: [ln0.95+ln0.95+ln0.95+ln0.95+ln0.95]/5=0.051[\ln 0.95 + \ln 0.95 + \ln 0.95 + \ln 0.95 + \ln 0.95]/5 = -0.051
  • C: [ln0.5+ln0.5+ln0.5+ln0.5+ln0.5]/5=0.693[\ln 0.5 + \ln 0.5 + \ln 0.5 + \ln 0.5 + \ln 0.5]/5 = -0.693

Forecaster B looks best by log score on this sample because B was correct every time and gave extreme probabilities. But B is badly calibrated: B said 95% rain on day 2 and day 5, but it did not rain. Over many forecasts, B's overconfidence would be penalized severely by the log score's unbounded negative penalty for confident wrong predictions.

Brier scores (lower is better):

  • A: [(0.71)2+(0.30)2+(0.81)2+(0.61)2+(0.20)2]/5=[0.09+0.09+0.04+0.16+0.04]/5=0.084[(0.7-1)^2 + (0.3-0)^2 + (0.8-1)^2 + (0.6-1)^2 + (0.2-0)^2]/5 = [0.09 + 0.09 + 0.04 + 0.16 + 0.04]/5 = 0.084
  • B: [(0.951)2+(0.050)2+(0.951)2+(0.951)2+(0.050)2]/5=[0.0025×5]/5=0.0025[(0.95-1)^2 + (0.05-0)^2 + (0.95-1)^2 + (0.95-1)^2 + (0.05-0)^2]/5 = [0.0025 \times 5]/5 = 0.0025
  • C: [(0.51)2+(0.50)2+(0.51)2+(0.51)2+(0.50)2]/5=[0.25×5]/5=0.25[(0.5-1)^2 + (0.5-0)^2 + (0.5-1)^2 + (0.5-1)^2 + (0.5-0)^2]/5 = [0.25 \times 5]/5 = 0.25

On this 5-day sample, B wins both scores. The key point: with 5 observations, you cannot detect miscalibration. Over 1000 forecasts, B's 95% predictions would include many wrong days, and the scores would correctly penalize B's overconfidence.

Connection to Calibration

A forecaster is calibrated if, among all instances where the forecaster reports probability pp, the actual frequency of the event is approximately pp. Proper scoring rules incentivize calibration, but they also reward sharpness (making predictions close to 0 or 1 when justified).

The Brier score decomposition (below) makes this precise: the Brier score equals calibration error plus a refinement term minus an irreducible uncertainty. A forecaster can improve their Brier score by becoming better calibrated (reducing the first term) or by making sharper predictions that separate events from non-events (improving the second term).

In practice, post-hoc calibration methods like Platt scaling (fitting a logistic regression to the model's output probabilities on a held-out set) or isotonic regression can fix miscalibration without retraining the model. These methods optimize the calibration component of the Brier decomposition while preserving the model's ranking of examples.

Watch Out

A calibrated model can still have poor accuracy

Calibration is about the reliability of probability estimates, not about the model's ability to discriminate between classes. A model that always predicts the base rate (e.g., 40% for a class that occurs 40% of the time) is perfectly calibrated but completely useless for decision-making. Good models are both calibrated and sharp: their predictions are spread across [0,1][0,1] and the extreme predictions are reliable.

Why Accuracy Is Not Proper

Consider a binary prediction problem with true probability q=0.6q = 0.6. Under accuracy (predict the more likely class, get 1 if correct, 0 if wrong):

  • Reporting p=0.6p = 0.6: predict class 1, accuracy = q=0.6q = 0.6
  • Reporting p=0.99p = 0.99: predict class 1, accuracy = q=0.6q = 0.6
  • Reporting p=0.01p = 0.01: predict class 0, accuracy = 1q=0.41 - q = 0.4

Any report p>0.5p > 0.5 gives the same expected accuracy. The forecaster is not rewarded for reporting 0.6 vs 0.99. Accuracy only cares about which side of 0.5 you are on, not how calibrated your probability is. This makes accuracy an improper scoring rule: it does not incentivize honest probability reporting.

Common Confusions

Watch Out

Proper scoring rules do not guarantee calibration

A proper scoring rule incentivizes the forecaster to report their true belief. But the forecaster's true belief might itself be wrong (the model might be badly trained). Properness means: if the model has access to the true conditional probabilities, reporting them is optimal. It does not mean the model actually has access to the true probabilities.

Watch Out

Log loss and cross-entropy are the same thing

The log score Slog(p,y)=ylnp+(1y)ln(1p)S_{\log}(p, y) = y \ln p + (1-y) \ln(1-p) is the negative of the binary cross-entropy loss. Maximizing log score is equivalent to minimizing cross-entropy. The sign convention is the only difference: scoring rules are maximized, loss functions are minimized.

Exercises

ExerciseCore

Problem

A forecaster believes P(Y=1)=0.3P(Y = 1) = 0.3. Compute the expected log score if they honestly report p=0.3p = 0.3 and if they misreport p=0.5p = 0.5. Verify that honest reporting gives a higher expected score.

ExerciseAdvanced

Problem

Prove that the quadratic scoring rule S(p,y)=2py+2(1p)(1y)p2(1p)2S(p, y) = 2py + 2(1-p)(1-y) - p^2 - (1-p)^2 is strictly proper. Show that it is an affine transformation of the Brier score.

References

Canonical:

  • Gneiting & Raftery, Strictly Proper Scoring Rules, Prediction, and Estimation (2007)
  • Brier, Verification of Forecasts Expressed in Terms of Probability (1950)

Current:

  • Murphy & Winkler, Reliability of Subjective Probability Forecasts of Precipitation and Temperature (1977)

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics