AI Safety
Calibration and Uncertainty Quantification
When a model says 70% confidence, is it right 70% of the time? Calibration measures the alignment between predicted probabilities and actual outcomes. Temperature scaling, Platt scaling, conformal prediction, and MC dropout provide practical tools for trustworthy uncertainty.
Prerequisites
Why This Matters
A model that reports 90% confidence but is correct only 60% of the time is dangerous. In medical diagnosis, autonomous driving, and financial risk assessment, decisions depend not just on predictions but on how much to trust those predictions. Calibration is the property that makes predicted probabilities meaningful.
Modern neural networks are often poorly calibrated: they tend to be overconfident. A ResNet trained on ImageNet may assign 95% confidence to predictions that are correct only 80% of the time. Post-hoc calibration methods fix this cheaply, and conformal prediction provides distribution-free coverage guarantees without any assumptions about the model. See also proper scoring rules for the theory of what makes a calibration metric valid.
Mental Model
Think of calibration as a contract between the model and the user. If the model says "I am 70% sure this is a cat," then across all images where the model says 70%, roughly 70% should actually be cats. If the model systematically overestimates or underestimates its confidence, the contract is broken.
Uncertainty quantification goes further: it asks not just "how confident?" but "what is the set of plausible answers?" Conformal prediction answers this by constructing prediction sets that are guaranteed to contain the true answer with a user-specified probability.
Formal Setup and Notation
Let be a classifier that outputs a probability vector over classes. Let be the predicted class and the associated confidence.
Perfect Calibration
A classifier is perfectly calibrated if for all confidence levels :
That is, among all predictions made with confidence , exactly a fraction are correct.
Expected Calibration Error (ECE)
Partition predictions into bins by confidence level. The expected calibration error is:
where is the accuracy within bin and is the average confidence within that bin.
Calibration Methods
Temperature Scaling
The simplest post-hoc calibration method. Given logits , apply a single scalar temperature :
When , the softmax output becomes softer (less confident). When , it becomes sharper. The temperature is optimized on a held-out validation set by minimizing the negative log-likelihood. Temperature scaling does not change the predicted class (since softmax is monotone), only the confidence.
Platt Scaling
For binary classification, fit a logistic regression on the logits:
where is the model's raw logit output and are learned on a validation set. This generalizes temperature scaling by adding a bias term.
Platt Scaling Calibration
Statement
If the true calibration function (mapping logits to correct-prediction probability) is a sigmoid, then Platt scaling with parameters fitted by maximum likelihood on the calibration set recovers the true calibration function as the calibration set size grows. The calibrated probabilities satisfy in probability.
Intuition
Platt scaling reparameterizes the model's confidence through a learned sigmoid. If the model's logits are monotonically related to true probability (which is common for well-trained models), a simple two-parameter fit corrects the miscalibration.
Why It Matters
Platt scaling is the standard calibration method for SVMs and is widely used as a baseline for neural network calibration. It requires only a small validation set and adds negligible computational cost.
Conformal Prediction
Conformal prediction is a structurally different approach: instead of calibrating point predictions, it constructs prediction sets with guaranteed coverage.
Nonconformity Score
A nonconformity score measures how unusual the pair is relative to the model. A common choice for classification:
where is the model's predicted probability for class . High scores mean the model finds this label surprising.
Split Conformal Coverage Guarantee
Statement
Given calibration data and a new test point , all exchangeable. Compute nonconformity scores for . Let be the quantile of . Then the prediction set:
satisfies:
Intuition
If the test point is exchangeable with the calibration data, its nonconformity score is equally likely to fall anywhere in the ranking. By choosing the threshold as the right quantile, we guarantee coverage. No assumptions about the model or data distribution are needed.
Proof Sketch
By exchangeability, the rank of among is uniformly distributed over . The probability that exceeds the -th smallest score is at most . Therefore with probability at least .
Why It Matters
Conformal prediction is the only widely-used method that provides distribution-free, finite-sample coverage guarantees. It works with any model (neural networks, random forests, LLMs) and any data type. The guarantee holds without assuming the model is correct or well-calibrated.
Failure Mode
The marginal coverage guarantee does not ensure conditional coverage: the set may be too large for easy inputs and too small for hard inputs. Achieving approximate conditional coverage is an active research area. Also, the prediction sets can be large if the underlying model is poor.
MC Dropout for Uncertainty
Monte Carlo (MC) dropout provides a practical approximation to Bayesian uncertainty. At inference time, run the model times with dropout enabled. The predictions approximate samples from the posterior predictive distribution.
The mean gives a point prediction. The variance gives an uncertainty estimate:
High variance indicates the model is uncertain about the input. MC dropout is cheap (just run inference multiple times) but the quality of the uncertainty estimate depends on the dropout rate and network architecture.
Common Confusions
Calibration is not accuracy
A model can be perfectly calibrated but have low accuracy (it just knows what it does not know). Conversely, a model can be highly accurate but poorly calibrated (it is always overconfident). Calibration and accuracy are independent properties. You want both.
Conformal prediction does not fix bad models
Conformal prediction guarantees coverage regardless of model quality, but the prediction sets will be large if the model is poor. A random classifier with conformal prediction will produce prediction sets that include almost all classes. The guarantee is real, but the sets are only useful if the model has reasonable discriminative power.
ECE depends on binning
The expected calibration error depends on the number of bins and the binning scheme. With too few bins, miscalibration is hidden. With too many bins, bins have few samples and the estimate is noisy. Kernel-based calibration error and classwise ECE are more robust alternatives.
Summary
- Calibration means predicted probabilities match empirical frequencies
- ECE measures calibration quality by comparing accuracy and confidence in bins
- Temperature scaling is the simplest fix: divide logits by a learned
- Platt scaling fits a two-parameter sigmoid to the logits on a validation set
- Conformal prediction gives distribution-free coverage:
- MC dropout approximates Bayesian uncertainty by running inference with dropout multiple times
- Calibration matters most in high-stakes settings: medicine, autonomy, finance
Exercises
Problem
A model makes 1000 predictions, each with confidence 0.9. Of these, 820 are correct. Is the model well-calibrated at the 90% confidence level? Compute the calibration gap for this bin.
Problem
You have a calibration set of examples and want conformal prediction sets with coverage . What quantile of the nonconformity scores do you use as the threshold? If you increase to 5000, how does the threshold change?
Problem
Conformal prediction guarantees marginal coverage but not conditional coverage . Construct an example where marginal coverage holds at 95% but conditional coverage fails badly for a specific subgroup.
References
Canonical:
- Guo et al., "On Calibration of Modern Neural Networks" (ICML 2017)
- Vovk, Gammerman, Shafer, Algorithmic Learning in a Random World (2005)
Current:
- Angelopoulos & Bates, "Conformal Prediction: A Gentle Introduction" (2023)
- Gal & Ghahramani, "Dropout as a Bayesian Approximation" (ICML 2016)
Next Topics
The natural next steps from calibration and uncertainty:
- Red-teaming and adversarial evaluation: testing whether models fail gracefully when calibration and uncertainty estimates are pushed to their limits
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Logistic RegressionLayer 1
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Differentiation in RnLayer 0A
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A