Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

AI Safety

Calibration and Uncertainty Quantification

When a model says 70% confidence, is it right 70% of the time? Calibration measures the alignment between predicted probabilities and actual outcomes. Temperature scaling, Platt scaling, conformal prediction, and MC dropout provide practical tools for trustworthy uncertainty.

AdvancedTier 2Current~50 min

Prerequisites

Why This Matters

perfect calibration0.00.20.40.60.81.0Actual accuracy0.00.20.40.60.81.0Predicted confidencegap = overconfidenceBefore calibrationAfter temperature scaling

A model that reports 90% confidence but is correct only 60% of the time is dangerous. In medical diagnosis, autonomous driving, and financial risk assessment, decisions depend not just on predictions but on how much to trust those predictions. Calibration is the property that makes predicted probabilities meaningful.

Modern neural networks are often poorly calibrated: they tend to be overconfident. A ResNet trained on ImageNet may assign 95% confidence to predictions that are correct only 80% of the time. Post-hoc calibration methods fix this cheaply, and conformal prediction provides distribution-free coverage guarantees without any assumptions about the model. See also proper scoring rules for the theory of what makes a calibration metric valid.

Mental Model

Think of calibration as a contract between the model and the user. If the model says "I am 70% sure this is a cat," then across all images where the model says 70%, roughly 70% should actually be cats. If the model systematically overestimates or underestimates its confidence, the contract is broken.

Uncertainty quantification goes further: it asks not just "how confident?" but "what is the set of plausible answers?" Conformal prediction answers this by constructing prediction sets that are guaranteed to contain the true answer with a user-specified probability.

Formal Setup and Notation

Let f(x)f(x) be a classifier that outputs a probability vector p^=f(x)ΔK1\hat{p} = f(x) \in \Delta^{K-1} over KK classes. Let y^=argmaxkp^k\hat{y} = \arg\max_k \hat{p}_k be the predicted class and p^y^\hat{p}_{\hat{y}} the associated confidence.

Definition

Perfect Calibration

A classifier ff is perfectly calibrated if for all confidence levels p[0,1]p \in [0, 1]:

P(y^=yp^y^=p)=p\mathbb{P}(\hat{y} = y \mid \hat{p}_{\hat{y}} = p) = p

That is, among all predictions made with confidence pp, exactly a fraction pp are correct.

Definition

Expected Calibration Error (ECE)

Partition predictions into MM bins B1,,BMB_1, \ldots, B_M by confidence level. The expected calibration error is:

ECE=m=1MBmnacc(Bm)conf(Bm)\text{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{n} \left|\text{acc}(B_m) - \text{conf}(B_m)\right|

where acc(Bm)\text{acc}(B_m) is the accuracy within bin BmB_m and conf(Bm)\text{conf}(B_m) is the average confidence within that bin.

Calibration Methods

Temperature Scaling

The simplest post-hoc calibration method. Given logits z=flogit(x)\mathbf{z} = f_{\text{logit}}(x), apply a single scalar temperature T>0T > 0:

p^k=exp(zk/T)jexp(zj/T)\hat{p}_k = \frac{\exp(z_k / T)}{\sum_j \exp(z_j / T)}

When T>1T > 1, the softmax output becomes softer (less confident). When T<1T < 1, it becomes sharper. The temperature TT is optimized on a held-out validation set by minimizing the negative log-likelihood. Temperature scaling does not change the predicted class (since softmax is monotone), only the confidence.

Platt Scaling

For binary classification, fit a logistic regression on the logits:

p^(y=1x)=σ(az+b)\hat{p}(y = 1 | x) = \sigma(a \cdot z + b)

where zz is the model's raw logit output and a,ba, b are learned on a validation set. This generalizes temperature scaling by adding a bias term.

Proposition

Platt Scaling Calibration

Statement

If the true calibration function (mapping logits to correct-prediction probability) is a sigmoid, then Platt scaling with parameters (a,b)(a, b) fitted by maximum likelihood on the calibration set recovers the true calibration function as the calibration set size grows. The calibrated probabilities satisfy P(y=1p^=p)p\mathbb{P}(y = 1 \mid \hat{p} = p) \to p in probability.

Intuition

Platt scaling reparameterizes the model's confidence through a learned sigmoid. If the model's logits are monotonically related to true probability (which is common for well-trained models), a simple two-parameter fit corrects the miscalibration.

Why It Matters

Platt scaling is the standard calibration method for SVMs and is widely used as a baseline for neural network calibration. It requires only a small validation set and adds negligible computational cost.

Conformal Prediction

Conformal prediction is a structurally different approach: instead of calibrating point predictions, it constructs prediction sets with guaranteed coverage.

Definition

Nonconformity Score

A nonconformity score s(x,y)s(x, y) measures how unusual the pair (x,y)(x, y) is relative to the model. A common choice for classification:

s(x,y)=1p^y(x)s(x, y) = 1 - \hat{p}_y(x)

where p^y(x)\hat{p}_y(x) is the model's predicted probability for class yy. High scores mean the model finds this label surprising.

Theorem

Split Conformal Coverage Guarantee

Statement

Given calibration data {(x1,y1),,(xn,yn)}\{(x_1, y_1), \ldots, (x_n, y_n)\} and a new test point (xn+1,yn+1)(x_{n+1}, y_{n+1}), all exchangeable. Compute nonconformity scores si=s(xi,yi)s_i = s(x_i, y_i) for i=1,,ni = 1, \ldots, n. Let q^\hat{q} be the (1α)(n+1)/n\lceil (1-\alpha)(n+1) \rceil / n quantile of {s1,,sn}\{s_1, \ldots, s_n\}. Then the prediction set:

C(xn+1)={y:s(xn+1,y)q^}C(x_{n+1}) = \{y : s(x_{n+1}, y) \leq \hat{q}\}

satisfies:

P(yn+1C(xn+1))1α\mathbb{P}(y_{n+1} \in C(x_{n+1})) \geq 1 - \alpha

Intuition

If the test point is exchangeable with the calibration data, its nonconformity score is equally likely to fall anywhere in the ranking. By choosing the threshold q^\hat{q} as the right quantile, we guarantee coverage. No assumptions about the model or data distribution are needed.

Proof Sketch

By exchangeability, the rank of sn+1s_{n+1} among {s1,,sn+1}\{s_1, \ldots, s_{n+1}\} is uniformly distributed over {1,,n+1}\{1, \ldots, n+1\}. The probability that sn+1s_{n+1} exceeds the (1α)(n+1)\lceil(1-\alpha)(n+1)\rceil-th smallest score is at most α\alpha. Therefore sn+1q^s_{n+1} \leq \hat{q} with probability at least 1α1 - \alpha.

Why It Matters

Conformal prediction is the only widely-used method that provides distribution-free, finite-sample coverage guarantees. It works with any model (neural networks, random forests, LLMs) and any data type. The guarantee holds without assuming the model is correct or well-calibrated.

Failure Mode

The marginal coverage guarantee does not ensure conditional coverage: the set may be too large for easy inputs and too small for hard inputs. Achieving approximate conditional coverage is an active research area. Also, the prediction sets can be large if the underlying model is poor.

MC Dropout for Uncertainty

Monte Carlo (MC) dropout provides a practical approximation to Bayesian uncertainty. At inference time, run the model TT times with dropout enabled. The predictions {y^1,,y^T}\{\hat{y}_1, \ldots, \hat{y}_T\} approximate samples from the posterior predictive distribution.

The mean gives a point prediction. The variance gives an uncertainty estimate:

VarMC(x)=1Tt=1Ty^t2(1Tt=1Ty^t)2\text{Var}_{\text{MC}}(x) = \frac{1}{T} \sum_{t=1}^{T} \hat{y}_t^2 - \left(\frac{1}{T} \sum_{t=1}^{T} \hat{y}_t\right)^2

High variance indicates the model is uncertain about the input. MC dropout is cheap (just run inference multiple times) but the quality of the uncertainty estimate depends on the dropout rate and network architecture.

Common Confusions

Watch Out

Calibration is not accuracy

A model can be perfectly calibrated but have low accuracy (it just knows what it does not know). Conversely, a model can be highly accurate but poorly calibrated (it is always overconfident). Calibration and accuracy are independent properties. You want both.

Watch Out

Conformal prediction does not fix bad models

Conformal prediction guarantees coverage regardless of model quality, but the prediction sets will be large if the model is poor. A random classifier with conformal prediction will produce prediction sets that include almost all classes. The guarantee is real, but the sets are only useful if the model has reasonable discriminative power.

Watch Out

ECE depends on binning

The expected calibration error depends on the number of bins MM and the binning scheme. With too few bins, miscalibration is hidden. With too many bins, bins have few samples and the estimate is noisy. Kernel-based calibration error and classwise ECE are more robust alternatives.

Summary

  • Calibration means predicted probabilities match empirical frequencies
  • ECE measures calibration quality by comparing accuracy and confidence in bins
  • Temperature scaling is the simplest fix: divide logits by a learned T>1T > 1
  • Platt scaling fits a two-parameter sigmoid to the logits on a validation set
  • Conformal prediction gives distribution-free coverage: P(yC(x))1α\mathbb{P}(y \in C(x)) \geq 1 - \alpha
  • MC dropout approximates Bayesian uncertainty by running inference with dropout multiple times
  • Calibration matters most in high-stakes settings: medicine, autonomy, finance

Exercises

ExerciseCore

Problem

A model makes 1000 predictions, each with confidence 0.9. Of these, 820 are correct. Is the model well-calibrated at the 90% confidence level? Compute the calibration gap for this bin.

ExerciseAdvanced

Problem

You have a calibration set of n=500n = 500 examples and want conformal prediction sets with coverage 1α=0.951 - \alpha = 0.95. What quantile of the nonconformity scores do you use as the threshold? If you increase nn to 5000, how does the threshold change?

ExerciseResearch

Problem

Conformal prediction guarantees marginal coverage P(yC(x))1α\mathbb{P}(y \in C(x)) \geq 1 - \alpha but not conditional coverage P(yC(x)x)1α\mathbb{P}(y \in C(x) \mid x) \geq 1 - \alpha. Construct an example where marginal coverage holds at 95% but conditional coverage fails badly for a specific subgroup.

References

Canonical:

  • Guo et al., "On Calibration of Modern Neural Networks" (ICML 2017)
  • Vovk, Gammerman, Shafer, Algorithmic Learning in a Random World (2005)

Current:

  • Angelopoulos & Bates, "Conformal Prediction: A Gentle Introduction" (2023)
  • Gal & Ghahramani, "Dropout as a Bayesian Approximation" (ICML 2016)

Next Topics

The natural next steps from calibration and uncertainty:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics