Predictive Uncertainty
Split Conformal Prediction
A distribution-free, model-agnostic procedure that converts any point predictor into a prediction set with finite-sample marginal coverage. The only assumption is exchangeability. The proof is five lines.
Why This Matters
A supervised learning algorithm trained on produces a function that maps inputs to predictions. For any new input , the algorithm returns as its best guess at . This is enough for a benchmark leaderboard. It is not enough for a decision.
A medical triage model that predicts "low risk" with 60% confidence should behave differently than one predicting "low risk" with 99% confidence. A credit model that returns a point estimate of default probability tells a loan officer nothing about the range of plausible outcomes. A weather forecast of "15 degrees tomorrow" is less useful than "between 12 and 18 degrees tomorrow, with 90% probability."
The traditional response is to build a parametric model, derive a predictive distribution, and report a prediction interval. This works when the model is correctly specified. Outside that regime, which includes most applications of modern ML, parametric prediction intervals have no meaningful coverage guarantee: a Bayesian neural network can report a 95% credible interval that contains the truth 60% of the time.
Split conformal prediction takes any predictor, treats it as a black box, and wraps it in a procedure that delivers valid prediction sets with a finite-sample, distribution-free guarantee. The only assumption is that the data points are exchangeable. The entire argument fits on half a page.
Formal Setup
Let for be random variables taking values in . We observe the first pairs and the feature of a test point; the label is to be predicted. Fix a user-specified miscoverage level , typically for 90% coverage or for 95%.
The goal is a set-valued function such that
where the probability is over the joint distribution of all points. We want this to hold with no assumptions on that joint distribution beyond exchangeability, and to hold in finite samples rather than asymptotically.
Exchangeability: The Load-Bearing Assumption
Exchangeability
A sequence of random variables is exchangeable if its joint distribution is invariant under permutation. For every permutation of ,
Exchangeability is strictly weaker than i.i.d. Every i.i.d. sequence is exchangeable. Not every exchangeable sequence is i.i.d.: de Finetti's theorem shows that infinite exchangeable sequences are mixtures of i.i.d. sequences, which means exchangeability permits a latent common parameter.
For conformal prediction, exchangeability is the load-bearing column. Everything else can be relaxed. The features can be arbitrarily high-dimensional. The response can be continuous, categorical, or structured. The underlying predictor can be a linear regression, a random forest, a deep neural network, or a large language model. No distributional form is assumed.
When exchangeability fails, conformal prediction fails in a precise and quantifiable way. That is the subject of the weighted conformal prediction page. Here we assume it holds.
The Split Construction
Partition the available data into two disjoint pieces. Given labelled points, choose a split size and use
to fit the predictor . The remaining points form the calibration set
The predictor is treated as fixed after training. The calibration set is never used for fitting. This separation is the reason the coverage proof goes through in a few lines.
Nonconformity Score
A nonconformity score is a function that measures how unusual a candidate pair looks relative to the fitted predictor. Larger values mean more unusual. In regression the canonical choice is the absolute residual
In classification with softmax outputs , a common score is . The coverage guarantee holds for any measurable . Better scores produce tighter sets; they do not change the validity proof.
The score must be defined before the calibration set is inspected. It can depend on , which was fit on training data, but it cannot be tuned using calibration data. Violating this rule breaks the argument.
The Quantile Construction
Compute the score on every calibration point: for . Let be the empirical quantile of . The prediction set is
For the absolute-residual score this set is the interval of constant width .
The slightly odd quantile level rather than the natural is the single technical detail that turns the finite-sample guarantee from approximate to exact. We are computing a quantile over objects (the calibration scores plus the unknown test score), and the ceiling adjustment accounts for the discreteness of empirical quantiles.
Main Theorem
Split Conformal Coverage
Statement
Let be the split conformal prediction set at miscoverage level . Then
If the scores have no ties almost surely, the coverage is also upper-bounded:
Intuition
By exchangeability, the test score is just as likely to fall at any rank among the scores as any other. Its rank is uniform on . Choosing at the -th order statistic catches the test score with probability at least mechanically.
Proof Sketch
Let . Because was trained on only, the map is a fixed measurable function of the calibration set and the test point taken together. Exchangeability of the calibration and test points transfers to exchangeability of the scores .
The rank of among these exchangeable values is uniform on (up to ties). For any integer ,
where is the -th order statistic of the calibration scores. Setting gives . The event is exactly .
The upper bound in the no-ties case uses the same rank uniformity: the probability is at most .
Why It Matters
The proof uses nothing about the distribution of or , the function class of , or the dimension of the feature space. It uses exchangeability and the definition of the empirical quantile. That is the entire content of the theorem. Any predictor you can call, conformal prediction can wrap.
Failure Mode
The guarantee is marginal coverage, averaged over . It does not promise coverage at a specific , and distribution-free conditional coverage is impossible in a strong sense (see the next theorem). The guarantee also fails the moment exchangeability fails, which happens automatically under any nontrivial covariate shift or temporal drift.
Conformalized Quantile Regression
The absolute-residual score gives prediction intervals of constant width. That is wasteful when the true conditional variance of varies across the feature space. Conformalized quantile regression (CQR) fixes this.
Train two quantile regressors and on the training set, targeting the lower and upper conditional quantiles. Define the score as
which is positive exactly when falls outside the predicted quantile interval and measures how far. Apply the split conformal procedure to this score. The resulting prediction set is
an interval of adaptive width. Coverage is still guaranteed by the same exchangeability argument, because the score is just another measurable function. In practice CQR is now the default choice in applied regression settings.
Classification: Adaptive Prediction Sets
The naive classification score produces prediction sets that are too small on easy examples and too large on hard ones. Adaptive Prediction Sets (APS) use the cumulative mass of classes ranked at least as likely as :
Sets built from this score expand or contract with example difficulty. Regularized APS adds a penalty that discourages runaway set sizes on the hardest inputs; it is often the right default for classification with many classes.
A Three-Point Example
Let calibration residuals be , and take . The ceiling quantile level is , so .
By rank uniformity, the test residual has rank uniformly distributed on , so the probability it falls at or below is exactly. The prediction interval covers with probability .
Notice the quantile is the largest of the three calibration residuals, not an interpolation. For small the ceiling rule is binding, and coverage can sit noticeably above . For the slack is under .
What Split Conformal Does Not Give You
Marginal coverage is weaker than conditional coverage. The guarantee
averages over the randomness in . It does not guarantee
for every . A split conformal predictor can have 90% marginal coverage while delivering 50% coverage on one subpopulation and 100% on another. This is not a bug of the construction. It is the price of making no assumptions.
Conditional Coverage Is Distribution-Free Impossible
Statement
Let be absolutely continuous on . Any prediction set procedure that achieves distribution-free conditional coverage
for every (uniformly in the distribution) must have infinite expected Lebesgue measure: .
Intuition
A procedure that must cover at every , without any smoothness or structural assumption tying nearby values together, cannot borrow strength across the feature space. At a point never seen in training the procedure has no information, so it must return the whole space (or a set of positive measure under arbitrary distributions). Averaging such sets gives infinite expected size.
Why It Matters
Useful conditional coverage is purchasable only by imposing structural assumptions (smoothness, parametric form, localizability) or by relaxing the target (coverage over bands or groups rather than every point). Conformal prediction is honest about this. Parametric prediction intervals that claim conditional coverage implicitly rely on the model being correct; conformal does not make that claim in the first place.
Split conformal also makes no claim about set-size optimality. The intervals it produces have valid coverage but may be wider than necessary. The width depends entirely on the quality of the underlying predictor. Conformal prediction is a coverage-calibration tool, not a prediction-improvement tool.
Common Confusions
Conformal does not improve your predictor
Conformal prediction wraps a fitted model. If the predictor is poor, the conformal interval is wide. If the predictor is good, the interval is narrow. Either way the coverage probability is correct. Do not expect conformal calibration to recover signal the underlying model missed.
Marginal coverage is not conditional coverage
A 90% marginal guarantee says that averaged over , the interval covers ninety percent of the time. It says nothing about any particular subgroup. Split conformal can give 99% coverage on the majority subgroup and 50% on a minority subgroup, and still satisfy the 90% marginal guarantee. For group-level coverage you need group-conditional conformal or Mondrian conformal, which apply the construction within each group.
The test point enters the quantile
Many first implementations compute the -quantile of the calibration scores and stop there. The correct quantile is , which accounts for the test point being one of the exchangeable scores. Using directly undercovers slightly for small .
Split conformal uses half the data
Splitting sacrifices statistical efficiency: the predictor sees only training points and the calibration set sees only . Full conformal prediction uses all points for both roles by refitting per candidate label, at exponential cost. Jackknife+ and CV+ recover most of the efficiency at polynomial cost and are worth knowing as intermediate options.
Forward Connection
The exchangeability assumption is where split conformal begins to strain in practice. Training data collected six months ago and test data collected today are rarely exchangeable. Models deployed across different populations face covariate shift by design. The natural extension is to weight the calibration points by the likelihood ratio between the test and training distributions, recovering coverage at the cost of estimating that ratio. That is the subject of the weighted conformal prediction page.
An orthogonal extension brings in sequential inference. Conformal prediction as stated holds at a fixed sample size. If an analyst peeks at coverage as more data arrives and stops when convenient, the marginal guarantee breaks. The fix uses e-values and test martingales, developed on the e-values and anytime-valid inference page.
Proofs to Rederive by Hand
Reproducing these proofs from scratch is how you stop recognizing and start owning. Target: reproduce each in a few minutes after a week.
Split conformal coverage (the five-line argument). The rank-uniformity step is the template for nearly every conformal result. Start from " has uniform rank on among the exchangeable scores," convert that to , and pick . Know why the ceiling is there, not just that it is. After a week, target a clean reproduction in under three minutes.
Conditional coverage impossibility (the no-free-lunch construction). Build two distributions that agree on a training set but differ on a thin slice of -space. Any procedure with uniform coverage across distributions must cover on both, so at any in that slice the set must be the whole -axis. Average over to get infinite expected Lebesgue measure. Writing this out makes precise why distribution-free conditional coverage costs the entire real line, and why parametric prediction intervals are paying that cost with unexamined assumptions instead.
Exercises
Problem
Let calibration residuals be the integers , and take . Compute and the resulting prediction interval width.
Problem
A colleague implements split conformal using the unadjusted -quantile of the calibration residuals. For and , quantify the worst-case undercoverage gap introduced by this shortcut.
Problem
Construct a joint distribution on with , , a perfectly accurate predictor on the subgroup, and a completely uninformative predictor on the subgroup, such that split conformal achieves marginal coverage but conditional coverage on is . Explain what this implies for auditing conformal deployments.
Problem
Jackknife+ (Barber, Candès, Ramdas, Tibshirani 2021) recovers most of the statistical efficiency lost by splitting, at polynomial rather than exponential cost. State the jackknife+ prediction set construction and identify the weakened coverage guarantee relative to split conformal. Under what predictor-stability condition is the guarantee strengthened back to ?
Open Problems and Frontier
Distribution-free conditional coverage under minimal structural assumptions is open. Current partial results require smoothness or compactness conditions that rarely hold in high dimensions. The question of whether some intermediate notion between marginal and conditional coverage can be achieved distribution-free remains active.
Conformal prediction for dependent data (time series, spatial) requires either explicit modelling of the dependence or the nonexchangeable framework of Barber, Candès, Ramdas, Tibshirani (2023). Neither gives guarantees as clean as the i.i.d. case.
Computational shortcuts for full conformal prediction, which split conformal approximates by discarding half the data, are an ongoing line. Jackknife+ and CV+ partially close the efficiency gap; influence-function-based approximations are the current frontier.
Anytime-valid conformal prediction using e-values is an active direction. The standard procedure has a guarantee at a single fixed sample size. Extending to online settings where the analyst peeks at coverage as more data arrives requires the e-value machinery covered in a separate page.
References
Canonical:
- Vovk, Gammerman, Shafer, Algorithmic Learning in a Random World (Springer, 2005). Chapters 2-3. The original book-length treatment.
- Shafer, Vovk, "A Tutorial on Conformal Prediction." Journal of Machine Learning Research 9 (2008), 371-421.
Modern pedagogical:
- Angelopoulos, Bates, "A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification." Foundations and Trends in Machine Learning 16(4) (2023), 494-591. The best single reference.
- Lei, G'Sell, Rinaldo, Tibshirani, Wasserman, "Distribution-Free Predictive Inference for Regression." Journal of the American Statistical Association 113(523) (2018), 1094-1111.
Adaptive methods:
- Romano, Patterson, Candès, "Conformalized Quantile Regression." NeurIPS 2019.
- Romano, Sesia, Candès, "Classification with Valid and Adaptive Coverage." NeurIPS 2020.
Limits and extensions:
- Barber, Candès, Ramdas, Tibshirani, "The Limits of Distribution-Free Conditional Predictive Inference." Information and Inference 10(2) (2021), 455-482.
- Barber, Candès, Ramdas, Tibshirani, "Predictive Inference with the Jackknife+." Annals of Statistics 49(1) (2021), 486-507.
- Barber, Candès, Ramdas, Tibshirani, "Conformal Prediction Beyond Exchangeability." Annals of Statistics 51(2) (2023), 816-845.
Next Topics
- Weighted conformal prediction: recover coverage under known or estimated covariate shift.
- E-values and anytime-valid inference: coverage under sequential peeking.
- Double/debiased machine learning: the causal-inference companion to distribution-free prediction.
- Calibration and uncertainty: Platt scaling, isotonic regression, and why conformal is a different object.
Last reviewed: April 24, 2026
Prerequisites
Foundations this topic depends on.
- Order StatisticsLayer 1
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Hypothesis Testing for MLLayer 2
- Cross-Validation TheoryLayer 2
- Empirical Risk MinimizationLayer 2
- Concentration InequalitiesLayer 1
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Random VariablesLayer 0A
- Kolmogorov Probability AxiomsLayer 0A
- Bias-Variance TradeoffLayer 2