Predictive Uncertainty

Split Conformal Prediction

A distribution-free, model-agnostic procedure that converts any point predictor into a prediction set with finite-sample marginal coverage. The only assumption is exchangeability. The proof is five lines.

AdvancedTier 1Current~45 min

Prerequisites

Order Statistics Hypothesis Testing for ML Cross Validation Theory

Quiz (7)Pulse Check Prereq Map

Why This Matters

A supervised learning algorithm trained on $(X_1, Y_1), \ldots, (X_n, Y_n)$ produces a function $\hat{f}$ that maps inputs to predictions. For any new input $X_{n+1}$ , the algorithm returns $\hat{f}(X_{n+1})$ as its best guess at $Y_{n+1}$ . This is enough for a benchmark leaderboard. It is not enough for a decision.

A medical triage model that predicts "low risk" with 60% confidence should behave differently than one predicting "low risk" with 99% confidence. A credit model that returns a point estimate of default probability tells a loan officer nothing about the range of plausible outcomes. A weather forecast of "15 degrees tomorrow" is less useful than "between 12 and 18 degrees tomorrow, with 90% probability."

The traditional response is to build a parametric model, derive a predictive distribution, and report a prediction interval. This works when the model is correctly specified. Outside that regime, which includes most applications of modern ML, parametric prediction intervals have no meaningful coverage guarantee: a Bayesian neural network can report a 95% credible interval that contains the truth 60% of the time.

Split conformal prediction takes any predictor, treats it as a black box, and wraps it in a procedure that delivers valid prediction sets with a finite-sample, distribution-free guarantee. The only assumption is that the data points are exchangeable. The entire argument fits on half a page.

Formal Setup

Let $(X_i, Y_i)$ for $i = 1, \ldots, n+1$ be random variables taking values in $\mathcal{X} \times \mathcal{Y}$ . We observe the first $n$ pairs and the feature $X_{n+1}$ of a test point; the label $Y_{n+1}$ is to be predicted. Fix a user-specified miscoverage level $\alpha \in (0, 1)$ , typically $\alpha = 0.1$ for 90% coverage or $\alpha = 0.05$ for 95%.

The goal is a set-valued function $\hat{C} : \mathcal{X} \to 2^{\mathcal{Y}}$ such that

$\mathbb{P}\bigl(Y_{n+1} \in \hat{C}(X_{n+1})\bigr) \geq 1 - \alpha,$

where the probability is over the joint distribution of all $n+1$ points. We want this to hold with no assumptions on that joint distribution beyond exchangeability, and to hold in finite samples rather than asymptotically.

Exchangeability: The Load-Bearing Assumption

Definition

Exchangeability

A sequence $Z_1, \ldots, Z_n$ of random variables is exchangeable if its joint distribution is invariant under permutation. For every permutation $\pi$ of $\{1, \ldots, n\}$ ,

$(Z_1, \ldots, Z_n) \stackrel{d}{=} (Z_{\pi(1)}, \ldots, Z_{\pi(n)}).$

Exchangeability is strictly weaker than i.i.d. Every i.i.d. sequence is exchangeable. Not every exchangeable sequence is i.i.d.: de Finetti's theorem shows that infinite exchangeable sequences are mixtures of i.i.d. sequences, which means exchangeability permits a latent common parameter.

For conformal prediction, exchangeability is the load-bearing column. Everything else can be relaxed. The features can be arbitrarily high-dimensional. The response can be continuous, categorical, or structured. The underlying predictor can be a linear regression, a random forest, a deep neural network, or a large language model. No distributional form is assumed.

When exchangeability fails, conformal prediction fails in a precise and quantifiable way. That is the subject of the weighted conformal prediction page. Here we assume it holds.

The Split Construction

Partition the available data into two disjoint pieces. Given $n$ labelled points, choose a split size $n_1$ and use

$\mathcal{D}_{\mathrm{train}} = \{(X_i, Y_i) : i = 1, \ldots, n_1\}$

to fit the predictor $\hat{f}$ . The remaining $n_2 = n - n_1$ points form the calibration set

$\mathcal{D}_{\mathrm{cal}} = \{(X_i, Y_i) : i = n_1 + 1, \ldots, n\}.$

The predictor $\hat{f}$ is treated as fixed after training. The calibration set is never used for fitting. This separation is the reason the coverage proof goes through in a few lines.

Definition

Nonconformity Score

A nonconformity score is a function $s : \mathcal{X} \times \mathcal{Y} \to \mathbb{R}$ that measures how unusual a candidate $(x, y)$ pair looks relative to the fitted predictor. Larger values mean more unusual. In regression the canonical choice is the absolute residual

$s(x, y) = |y - \hat{f}(x)|.$

In classification with softmax outputs $\hat{p}_y(x)$ , a common score is $s(x, y) = 1 - \hat{p}_y(x)$ . The coverage guarantee holds for any measurable $s$ . Better scores produce tighter sets; they do not change the validity proof.

The score must be defined before the calibration set is inspected. It can depend on $\hat{f}$ , which was fit on training data, but it cannot be tuned using calibration data. Violating this rule breaks the argument.

The Quantile Construction

Compute the score on every calibration point: $s_i = s(X_i, Y_i)$ for $i = n_1 + 1, \ldots, n$ . Let $\hat{q}$ be the $\lceil (n_2 + 1)(1 - \alpha) \rceil / n_2$ empirical quantile of $\{s_{n_1+1}, \ldots, s_n\}$ . The prediction set is

$\hat{C}(X_{n+1}) = \bigl\{y \in \mathcal{Y} : s(X_{n+1}, y) \leq \hat{q}\bigr\}.$

For the absolute-residual score this set is the interval $[\hat{f}(X_{n+1}) - \hat{q}, \, \hat{f}(X_{n+1}) + \hat{q}]$ of constant width $2\hat{q}$ .

The slightly odd quantile level $\lceil (n_2 + 1)(1 - \alpha) \rceil / n_2$ rather than the natural $(1 - \alpha)$ is the single technical detail that turns the finite-sample guarantee from approximate to exact. We are computing a quantile over $n_2 + 1$ objects (the calibration scores plus the unknown test score), and the ceiling adjustment accounts for the discreteness of empirical quantiles.

Main Theorem

Theorem

Split Conformal Coverage

Statement

Let $\hat{C}(X_{n+1})$ be the split conformal prediction set at miscoverage level $\alpha$ . Then

$\mathbb{P}\bigl(Y_{n+1} \in \hat{C}(X_{n+1})\bigr) \geq 1 - \alpha.$

If the scores have no ties almost surely, the coverage is also upper-bounded:

$\mathbb{P}\bigl(Y_{n+1} \in \hat{C}(X_{n+1})\bigr) \leq 1 - \alpha + \frac{1}{n_2 + 1}.$

Intuition

By exchangeability, the test score $s_{n+1} = s(X_{n+1}, Y_{n+1})$ is just as likely to fall at any rank among the $n_2 + 1$ scores as any other. Its rank is uniform on $\{1, \ldots, n_2 + 1\}$ . Choosing $\hat{q}$ at the $\lceil (n_2 + 1)(1 - \alpha) \rceil$ -th order statistic catches the test score with probability at least $1 - \alpha$ mechanically.

Proof Sketch

Let $s_{n+1} = s(X_{n+1}, Y_{n+1})$ . Because $\hat{f}$ was trained on $\mathcal{D}_{\mathrm{train}}$ only, the map $(x, y) \mapsto s(x, y)$ is a fixed measurable function of the calibration set and the test point taken together. Exchangeability of the calibration and test points transfers to exchangeability of the scores $s_{n_1+1}, \ldots, s_n, s_{n+1}$ .

The rank of $s_{n+1}$ among these $n_2 + 1$ exchangeable values is uniform on $\{1, \ldots, n_2 + 1\}$ (up to ties). For any integer $k$ ,

$\mathbb{P}(s_{n+1} \leq s_{(k)}) = \frac{k}{n_2 + 1},$

where $s_{(k)}$ is the $k$ -th order statistic of the calibration scores. Setting $k = \lceil (n_2 + 1)(1 - \alpha) \rceil$ gives $\mathbb{P}(s_{n+1} \leq \hat{q}) \geq 1 - \alpha$ . The event $\{s_{n+1} \leq \hat{q}\}$ is exactly $\{Y_{n+1} \in \hat{C}(X_{n+1})\}$ .

The upper bound in the no-ties case uses the same rank uniformity: the probability is at most $k / (n_2 + 1) \leq 1 - \alpha + 1/(n_2 + 1)$ .

Why It Matters

The proof uses nothing about the distribution of $X$ or $Y$ , the function class of $\hat{f}$ , or the dimension of the feature space. It uses exchangeability and the definition of the empirical quantile. That is the entire content of the theorem. Any predictor you can call, conformal prediction can wrap.

Failure Mode

The guarantee is marginal coverage, averaged over $X_{n+1}$ . It does not promise coverage at a specific $X_{n+1} = x$ , and distribution-free conditional coverage is impossible in a strong sense (see the next theorem). The guarantee also fails the moment exchangeability fails, which happens automatically under any nontrivial covariate shift or temporal drift.

Conformalized Quantile Regression

The absolute-residual score gives prediction intervals of constant width. That is wasteful when the true conditional variance of $Y \mid X$ varies across the feature space. Conformalized quantile regression (CQR) fixes this.

Train two quantile regressors $\hat{q}_{\alpha/2}(x)$ and $\hat{q}_{1 - \alpha/2}(x)$ on the training set, targeting the lower and upper conditional quantiles. Define the score as

$s(x, y) = \max\bigl(\hat{q}_{\alpha/2}(x) - y, \; y - \hat{q}_{1-\alpha/2}(x)\bigr),$

which is positive exactly when $y$ falls outside the predicted quantile interval and measures how far. Apply the split conformal procedure to this score. The resulting prediction set is

$\hat{C}(x) = \bigl[\hat{q}_{\alpha/2}(x) - \hat{q}, \; \hat{q}_{1-\alpha/2}(x) + \hat{q}\bigr],$

an interval of adaptive width. Coverage is still guaranteed by the same exchangeability argument, because the score is just another measurable function. In practice CQR is now the default choice in applied regression settings.

Classification: Adaptive Prediction Sets

The naive classification score $s(x, y) = 1 - \hat{p}_y(x)$ produces prediction sets that are too small on easy examples and too large on hard ones. Adaptive Prediction Sets (APS) use the cumulative mass of classes ranked at least as likely as $y$ :

$s(x, y) = \sum_{k : \hat{p}_k(x) \geq \hat{p}_y(x)} \hat{p}_k(x).$

Sets built from this score expand or contract with example difficulty. Regularized APS adds a penalty that discourages runaway set sizes on the hardest inputs; it is often the right default for classification with many classes.

A Three-Point Example

Example

Let $n_2 = 3$ calibration residuals be $s_{(1)} = 0.5, s_{(2)} = 1.2, s_{(3)} = 2.1$ , and take $\alpha = 0.25$ . The ceiling quantile level is $\lceil 4 \cdot 0.75 \rceil / 3 = 3/3$ , so $\hat{q} = s_{(3)} = 2.1$ .

By rank uniformity, the test residual $|Y_4 - \hat{f}(X_4)|$ has rank uniformly distributed on $\{1, 2, 3, 4\}$ , so the probability it falls at or below $\hat{q}$ is $3/4 = 1 - \alpha$ exactly. The prediction interval $[\hat{f}(X_4) - 2.1, \hat{f}(X_4) + 2.1]$ covers $Y_4$ with probability $0.75$ .

Notice the quantile is the largest of the three calibration residuals, not an interpolation. For small $n_2$ the ceiling rule is binding, and coverage can sit noticeably above $1 - \alpha$ . For $n_2 \geq 100$ the slack is under $1\%$ .

What Split Conformal Does Not Give You

Marginal coverage is weaker than conditional coverage. The guarantee

$\mathbb{P}\bigl(Y_{n+1} \in \hat{C}(X_{n+1})\bigr) \geq 1 - \alpha$

averages over the randomness in $X_{n+1}$ . It does not guarantee

$\mathbb{P}\bigl(Y_{n+1} \in \hat{C}(X_{n+1}) \mid X_{n+1} = x\bigr) \geq 1 - \alpha$

for every $x$ . A split conformal predictor can have 90% marginal coverage while delivering 50% coverage on one subpopulation and 100% on another. This is not a bug of the construction. It is the price of making no assumptions.

Theorem

Conditional Coverage Is Distribution-Free Impossible

Statement

Let $P_X$ be absolutely continuous on $\mathbb{R}^d$ . Any prediction set procedure $\hat{C}$ that achieves distribution-free conditional coverage

$\mathbb{P}\bigl(Y_{n+1} \in \hat{C}(X_{n+1}) \mid X_{n+1} = x\bigr) \geq 1 - \alpha$

for every $x$ (uniformly in the distribution) must have infinite expected Lebesgue measure: $\mathbb{E}[\mathrm{Leb}(\hat{C}(X_{n+1}))] = \infty$ .

Intuition

A procedure that must cover at every $x$ , without any smoothness or structural assumption tying nearby $x$ values together, cannot borrow strength across the feature space. At a point never seen in training the procedure has no information, so it must return the whole space (or a set of positive measure under arbitrary distributions). Averaging such sets gives infinite expected size.

Why It Matters

Useful conditional coverage is purchasable only by imposing structural assumptions (smoothness, parametric form, localizability) or by relaxing the target (coverage over bands or groups rather than every point). Conformal prediction is honest about this. Parametric prediction intervals that claim conditional coverage implicitly rely on the model being correct; conformal does not make that claim in the first place.

Split conformal also makes no claim about set-size optimality. The intervals it produces have valid coverage but may be wider than necessary. The width depends entirely on the quality of the underlying predictor. Conformal prediction is a coverage-calibration tool, not a prediction-improvement tool.

Common Confusions

Watch Out

Conformal does not improve your predictor

Conformal prediction wraps a fitted model. If the predictor is poor, the conformal interval is wide. If the predictor is good, the interval is narrow. Either way the coverage probability is correct. Do not expect conformal calibration to recover signal the underlying model missed.

Watch Out

Marginal coverage is not conditional coverage

A 90% marginal guarantee says that averaged over $X$ , the interval covers $Y$ ninety percent of the time. It says nothing about any particular subgroup. Split conformal can give 99% coverage on the majority subgroup and 50% on a minority subgroup, and still satisfy the 90% marginal guarantee. For group-level coverage you need group-conditional conformal or Mondrian conformal, which apply the construction within each group.

Watch Out

The test point enters the quantile

Many first implementations compute the $(1 - \alpha)$ -quantile of the $n_2$ calibration scores and stop there. The correct quantile is $\lceil (n_2 + 1)(1 - \alpha) \rceil / n_2$ , which accounts for the test point being one of the $n_2 + 1$ exchangeable scores. Using $(1 - \alpha)$ directly undercovers slightly for small $n_2$ .

Watch Out

Split conformal uses half the data

Splitting sacrifices statistical efficiency: the predictor sees only $n_1$ training points and the calibration set sees only $n_2$ . Full conformal prediction uses all $n$ points for both roles by refitting per candidate label, at exponential cost. Jackknife+ and CV+ recover most of the efficiency at polynomial cost and are worth knowing as intermediate options.

Forward Connection

The exchangeability assumption is where split conformal begins to strain in practice. Training data collected six months ago and test data collected today are rarely exchangeable. Models deployed across different populations face covariate shift by design. The natural extension is to weight the calibration points by the likelihood ratio between the test and training distributions, recovering coverage at the cost of estimating that ratio. That is the subject of the weighted conformal prediction page.

An orthogonal extension brings in sequential inference. Conformal prediction as stated holds at a fixed sample size. If an analyst peeks at coverage as more data arrives and stops when convenient, the marginal guarantee breaks. The fix uses e-values and test martingales, developed on the e-values and anytime-valid inference page.

Proofs to Rederive by Hand

Write it by hand

Reproducing these proofs from scratch is how you stop recognizing and start owning. Target: reproduce each in a few minutes after a week.

Split conformal coverage (the five-line argument). The rank-uniformity step is the template for nearly every conformal result. Start from " $s_{n+1}$ has uniform rank on $\{1, \ldots, n_2 + 1\}$ among the exchangeable scores," convert that to $\mathbb{P}(s_{n+1} \leq s_{(k)}) = k/(n_2+1)$ , and pick $k = \lceil (n_2+1)(1-\alpha) \rceil$ . Know why the ceiling is there, not just that it is. After a week, target a clean reproduction in under three minutes.

Conditional coverage impossibility (the no-free-lunch construction). Build two distributions that agree on a training set but differ on a thin slice of $x$ -space. Any procedure with uniform coverage across distributions must cover on both, so at any $x$ in that slice the set must be the whole $y$ -axis. Average over $P_X$ to get infinite expected Lebesgue measure. Writing this out makes precise why distribution-free conditional coverage costs the entire real line, and why parametric prediction intervals are paying that cost with unexamined assumptions instead.

Exercises

ExerciseCore

Problem

Let $n_2 = 99$ calibration residuals be the integers $1, 2, \ldots, 99$ , and take $\alpha = 0.1$ . Compute $\hat{q}$ and the resulting prediction interval width.

ExerciseCore

Problem

A colleague implements split conformal using the unadjusted $(1-\alpha)$ -quantile of the calibration residuals. For $\alpha = 0.1$ and $n_2 = 20$ , quantify the worst-case undercoverage gap introduced by this shortcut.

ExerciseAdvanced

Problem

Construct a joint distribution on $(X, Y)$ with $X \in \{0, 1\}$ , $\mathbb{P}(X = 0) = 0.9$ , a perfectly accurate predictor on the $X = 0$ subgroup, and a completely uninformative predictor on the $X = 1$ subgroup, such that split conformal achieves marginal coverage $\geq 0.9$ but conditional coverage on $X = 1$ is $0.0$ . Explain what this implies for auditing conformal deployments.

ExerciseResearch

Problem

Jackknife+ (Barber, Candès, Ramdas, Tibshirani 2021) recovers most of the statistical efficiency lost by splitting, at polynomial rather than exponential cost. State the jackknife+ prediction set construction and identify the weakened coverage guarantee relative to split conformal. Under what predictor-stability condition is the guarantee strengthened back to $1 - \alpha$ ?

Open Problems and Frontier

Distribution-free conditional coverage under minimal structural assumptions is open. Current partial results require smoothness or compactness conditions that rarely hold in high dimensions. The question of whether some intermediate notion between marginal and conditional coverage can be achieved distribution-free remains active.

Conformal prediction for dependent data (time series, spatial) requires either explicit modelling of the dependence or the nonexchangeable framework of Barber, Candès, Ramdas, Tibshirani (2023). Neither gives guarantees as clean as the i.i.d. case.

Computational shortcuts for full conformal prediction, which split conformal approximates by discarding half the data, are an ongoing line. Jackknife+ and CV+ partially close the efficiency gap; influence-function-based approximations are the current frontier.

Anytime-valid conformal prediction using e-values is an active direction. The standard procedure has a guarantee at a single fixed sample size. Extending to online settings where the analyst peeks at coverage as more data arrives requires the e-value machinery covered in a separate page.

References

Canonical:

Vovk, Gammerman, Shafer, Algorithmic Learning in a Random World (Springer, 2005). Chapters 2-3. The original book-length treatment.
Shafer, Vovk, "A Tutorial on Conformal Prediction." Journal of Machine Learning Research 9 (2008), 371-421.

Modern pedagogical:

Angelopoulos, Bates, "A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification." Foundations and Trends in Machine Learning 16(4) (2023), 494-591. The best single reference.
Lei, G'Sell, Rinaldo, Tibshirani, Wasserman, "Distribution-Free Predictive Inference for Regression." Journal of the American Statistical Association 113(523) (2018), 1094-1111.

Adaptive methods:

Romano, Patterson, Candès, "Conformalized Quantile Regression." NeurIPS 2019.
Romano, Sesia, Candès, "Classification with Valid and Adaptive Coverage." NeurIPS 2020.

Limits and extensions:

Barber, Candès, Ramdas, Tibshirani, "The Limits of Distribution-Free Conditional Predictive Inference." Information and Inference 10(2) (2021), 455-482.
Barber, Candès, Ramdas, Tibshirani, "Predictive Inference with the Jackknife+." Annals of Statistics 49(1) (2021), 486-507.
Barber, Candès, Ramdas, Tibshirani, "Conformal Prediction Beyond Exchangeability." Annals of Statistics 51(2) (2023), 816-845.

Next Topics

Weighted conformal prediction: recover coverage under known or estimated covariate shift.
E-values and anytime-valid inference: coverage under sequential peeking.
Double/debiased machine learning: the causal-inference companion to distribution-free prediction.
Calibration and uncertainty: Platt scaling, isotonic regression, and why conformal is a different object.

Last reviewed: April 24, 2026

Prerequisites

Foundations this topic depends on.

Order StatisticsLayer 1
Common Probability DistributionsLayer 0A
Sets, Functions, and RelationsLayer 0A
Basic Logic and Proof TechniquesLayer 0A
Hypothesis Testing for MLLayer 2
Cross-Validation TheoryLayer 2
Empirical Risk MinimizationLayer 2
Concentration InequalitiesLayer 1
Expectation, Variance, Covariance, and MomentsLayer 0A
Random VariablesLayer 0A
Kolmogorov Probability AxiomsLayer 0A
Bias-Variance TradeoffLayer 2

Builds on This

Weighted Conformal Prediction Under Covariate ShiftLayer 3

Next Topics

Weighted Conformal PredictionContinue →Calibration and UncertaintyContinue →