Skip to main content

Predictive Uncertainty

Split Conformal Prediction

A distribution-free, model-agnostic procedure that converts any point predictor into a prediction set with finite-sample marginal coverage. The only assumption is exchangeability. The proof is five lines.

AdvancedTier 1Current~45 min

Why This Matters

A supervised learning algorithm trained on (X1,Y1),,(Xn,Yn)(X_1, Y_1), \ldots, (X_n, Y_n) produces a function f^\hat{f} that maps inputs to predictions. For any new input Xn+1X_{n+1}, the algorithm returns f^(Xn+1)\hat{f}(X_{n+1}) as its best guess at Yn+1Y_{n+1}. This is enough for a benchmark leaderboard. It is not enough for a decision.

A medical triage model that predicts "low risk" with 60% confidence should behave differently than one predicting "low risk" with 99% confidence. A credit model that returns a point estimate of default probability tells a loan officer nothing about the range of plausible outcomes. A weather forecast of "15 degrees tomorrow" is less useful than "between 12 and 18 degrees tomorrow, with 90% probability."

The traditional response is to build a parametric model, derive a predictive distribution, and report a prediction interval. This works when the model is correctly specified. Outside that regime, which includes most applications of modern ML, parametric prediction intervals have no meaningful coverage guarantee: a Bayesian neural network can report a 95% credible interval that contains the truth 60% of the time.

Split conformal prediction takes any predictor, treats it as a black box, and wraps it in a procedure that delivers valid prediction sets with a finite-sample, distribution-free guarantee. The only assumption is that the data points are exchangeable. The entire argument fits on half a page.

Formal Setup

Let (Xi,Yi)(X_i, Y_i) for i=1,,n+1i = 1, \ldots, n+1 be random variables taking values in X×Y\mathcal{X} \times \mathcal{Y}. We observe the first nn pairs and the feature Xn+1X_{n+1} of a test point; the label Yn+1Y_{n+1} is to be predicted. Fix a user-specified miscoverage level α(0,1)\alpha \in (0, 1), typically α=0.1\alpha = 0.1 for 90% coverage or α=0.05\alpha = 0.05 for 95%.

The goal is a set-valued function C^:X2Y\hat{C} : \mathcal{X} \to 2^{\mathcal{Y}} such that

P(Yn+1C^(Xn+1))1α,\mathbb{P}\bigl(Y_{n+1} \in \hat{C}(X_{n+1})\bigr) \geq 1 - \alpha,

where the probability is over the joint distribution of all n+1n+1 points. We want this to hold with no assumptions on that joint distribution beyond exchangeability, and to hold in finite samples rather than asymptotically.

Exchangeability: The Load-Bearing Assumption

Definition

Exchangeability

A sequence Z1,,ZnZ_1, \ldots, Z_n of random variables is exchangeable if its joint distribution is invariant under permutation. For every permutation π\pi of {1,,n}\{1, \ldots, n\},

(Z1,,Zn)=d(Zπ(1),,Zπ(n)).(Z_1, \ldots, Z_n) \stackrel{d}{=} (Z_{\pi(1)}, \ldots, Z_{\pi(n)}).

Exchangeability is strictly weaker than i.i.d. Every i.i.d. sequence is exchangeable. Not every exchangeable sequence is i.i.d.: de Finetti's theorem shows that infinite exchangeable sequences are mixtures of i.i.d. sequences, which means exchangeability permits a latent common parameter.

For conformal prediction, exchangeability is the load-bearing column. Everything else can be relaxed. The features can be arbitrarily high-dimensional. The response can be continuous, categorical, or structured. The underlying predictor can be a linear regression, a random forest, a deep neural network, or a large language model. No distributional form is assumed.

When exchangeability fails, conformal prediction fails in a precise and quantifiable way. That is the subject of the weighted conformal prediction page. Here we assume it holds.

The Split Construction

Partition the available data into two disjoint pieces. Given nn labelled points, choose a split size n1n_1 and use

Dtrain={(Xi,Yi):i=1,,n1}\mathcal{D}_{\mathrm{train}} = \{(X_i, Y_i) : i = 1, \ldots, n_1\}

to fit the predictor f^\hat{f}. The remaining n2=nn1n_2 = n - n_1 points form the calibration set

Dcal={(Xi,Yi):i=n1+1,,n}.\mathcal{D}_{\mathrm{cal}} = \{(X_i, Y_i) : i = n_1 + 1, \ldots, n\}.

The predictor f^\hat{f} is treated as fixed after training. The calibration set is never used for fitting. This separation is the reason the coverage proof goes through in a few lines.

Definition

Nonconformity Score

A nonconformity score is a function s:X×YRs : \mathcal{X} \times \mathcal{Y} \to \mathbb{R} that measures how unusual a candidate (x,y)(x, y) pair looks relative to the fitted predictor. Larger values mean more unusual. In regression the canonical choice is the absolute residual

s(x,y)=yf^(x).s(x, y) = |y - \hat{f}(x)|.

In classification with softmax outputs p^y(x)\hat{p}_y(x), a common score is s(x,y)=1p^y(x)s(x, y) = 1 - \hat{p}_y(x). The coverage guarantee holds for any measurable ss. Better scores produce tighter sets; they do not change the validity proof.

The score must be defined before the calibration set is inspected. It can depend on f^\hat{f}, which was fit on training data, but it cannot be tuned using calibration data. Violating this rule breaks the argument.

The Quantile Construction

Compute the score on every calibration point: si=s(Xi,Yi)s_i = s(X_i, Y_i) for i=n1+1,,ni = n_1 + 1, \ldots, n. Let q^\hat{q} be the (n2+1)(1α)/n2\lceil (n_2 + 1)(1 - \alpha) \rceil / n_2 empirical quantile of {sn1+1,,sn}\{s_{n_1+1}, \ldots, s_n\}. The prediction set is

C^(Xn+1)={yY:s(Xn+1,y)q^}.\hat{C}(X_{n+1}) = \bigl\{y \in \mathcal{Y} : s(X_{n+1}, y) \leq \hat{q}\bigr\}.

For the absolute-residual score this set is the interval [f^(Xn+1)q^,f^(Xn+1)+q^][\hat{f}(X_{n+1}) - \hat{q}, \, \hat{f}(X_{n+1}) + \hat{q}] of constant width 2q^2\hat{q}.

The slightly odd quantile level (n2+1)(1α)/n2\lceil (n_2 + 1)(1 - \alpha) \rceil / n_2 rather than the natural (1α)(1 - \alpha) is the single technical detail that turns the finite-sample guarantee from approximate to exact. We are computing a quantile over n2+1n_2 + 1 objects (the calibration scores plus the unknown test score), and the ceiling adjustment accounts for the discreteness of empirical quantiles.

Main Theorem

Theorem

Split Conformal Coverage

Statement

Let C^(Xn+1)\hat{C}(X_{n+1}) be the split conformal prediction set at miscoverage level α\alpha. Then

P(Yn+1C^(Xn+1))1α.\mathbb{P}\bigl(Y_{n+1} \in \hat{C}(X_{n+1})\bigr) \geq 1 - \alpha.

If the scores have no ties almost surely, the coverage is also upper-bounded:

P(Yn+1C^(Xn+1))1α+1n2+1.\mathbb{P}\bigl(Y_{n+1} \in \hat{C}(X_{n+1})\bigr) \leq 1 - \alpha + \frac{1}{n_2 + 1}.

Intuition

By exchangeability, the test score sn+1=s(Xn+1,Yn+1)s_{n+1} = s(X_{n+1}, Y_{n+1}) is just as likely to fall at any rank among the n2+1n_2 + 1 scores as any other. Its rank is uniform on {1,,n2+1}\{1, \ldots, n_2 + 1\}. Choosing q^\hat{q} at the (n2+1)(1α)\lceil (n_2 + 1)(1 - \alpha) \rceil-th order statistic catches the test score with probability at least 1α1 - \alpha mechanically.

Proof Sketch

Let sn+1=s(Xn+1,Yn+1)s_{n+1} = s(X_{n+1}, Y_{n+1}). Because f^\hat{f} was trained on Dtrain\mathcal{D}_{\mathrm{train}} only, the map (x,y)s(x,y)(x, y) \mapsto s(x, y) is a fixed measurable function of the calibration set and the test point taken together. Exchangeability of the calibration and test points transfers to exchangeability of the scores sn1+1,,sn,sn+1s_{n_1+1}, \ldots, s_n, s_{n+1}.

The rank of sn+1s_{n+1} among these n2+1n_2 + 1 exchangeable values is uniform on {1,,n2+1}\{1, \ldots, n_2 + 1\} (up to ties). For any integer kk,

P(sn+1s(k))=kn2+1,\mathbb{P}(s_{n+1} \leq s_{(k)}) = \frac{k}{n_2 + 1},

where s(k)s_{(k)} is the kk-th order statistic of the calibration scores. Setting k=(n2+1)(1α)k = \lceil (n_2 + 1)(1 - \alpha) \rceil gives P(sn+1q^)1α\mathbb{P}(s_{n+1} \leq \hat{q}) \geq 1 - \alpha. The event {sn+1q^}\{s_{n+1} \leq \hat{q}\} is exactly {Yn+1C^(Xn+1)}\{Y_{n+1} \in \hat{C}(X_{n+1})\}.

The upper bound in the no-ties case uses the same rank uniformity: the probability is at most k/(n2+1)1α+1/(n2+1)k / (n_2 + 1) \leq 1 - \alpha + 1/(n_2 + 1).

Why It Matters

The proof uses nothing about the distribution of XX or YY, the function class of f^\hat{f}, or the dimension of the feature space. It uses exchangeability and the definition of the empirical quantile. That is the entire content of the theorem. Any predictor you can call, conformal prediction can wrap.

Failure Mode

The guarantee is marginal coverage, averaged over Xn+1X_{n+1}. It does not promise coverage at a specific Xn+1=xX_{n+1} = x, and distribution-free conditional coverage is impossible in a strong sense (see the next theorem). The guarantee also fails the moment exchangeability fails, which happens automatically under any nontrivial covariate shift or temporal drift.

Conformalized Quantile Regression

The absolute-residual score gives prediction intervals of constant width. That is wasteful when the true conditional variance of YXY \mid X varies across the feature space. Conformalized quantile regression (CQR) fixes this.

Train two quantile regressors q^α/2(x)\hat{q}_{\alpha/2}(x) and q^1α/2(x)\hat{q}_{1 - \alpha/2}(x) on the training set, targeting the lower and upper conditional quantiles. Define the score as

s(x,y)=max(q^α/2(x)y,  yq^1α/2(x)),s(x, y) = \max\bigl(\hat{q}_{\alpha/2}(x) - y, \; y - \hat{q}_{1-\alpha/2}(x)\bigr),

which is positive exactly when yy falls outside the predicted quantile interval and measures how far. Apply the split conformal procedure to this score. The resulting prediction set is

C^(x)=[q^α/2(x)q^,  q^1α/2(x)+q^],\hat{C}(x) = \bigl[\hat{q}_{\alpha/2}(x) - \hat{q}, \; \hat{q}_{1-\alpha/2}(x) + \hat{q}\bigr],

an interval of adaptive width. Coverage is still guaranteed by the same exchangeability argument, because the score is just another measurable function. In practice CQR is now the default choice in applied regression settings.

Classification: Adaptive Prediction Sets

The naive classification score s(x,y)=1p^y(x)s(x, y) = 1 - \hat{p}_y(x) produces prediction sets that are too small on easy examples and too large on hard ones. Adaptive Prediction Sets (APS) use the cumulative mass of classes ranked at least as likely as yy:

s(x,y)=k:p^k(x)p^y(x)p^k(x).s(x, y) = \sum_{k : \hat{p}_k(x) \geq \hat{p}_y(x)} \hat{p}_k(x).

Sets built from this score expand or contract with example difficulty. Regularized APS adds a penalty that discourages runaway set sizes on the hardest inputs; it is often the right default for classification with many classes.

A Three-Point Example

Example

Let n2=3n_2 = 3 calibration residuals be s(1)=0.5,s(2)=1.2,s(3)=2.1s_{(1)} = 0.5, s_{(2)} = 1.2, s_{(3)} = 2.1, and take α=0.25\alpha = 0.25. The ceiling quantile level is 40.75/3=3/3\lceil 4 \cdot 0.75 \rceil / 3 = 3/3, so q^=s(3)=2.1\hat{q} = s_{(3)} = 2.1.

By rank uniformity, the test residual Y4f^(X4)|Y_4 - \hat{f}(X_4)| has rank uniformly distributed on {1,2,3,4}\{1, 2, 3, 4\}, so the probability it falls at or below q^\hat{q} is 3/4=1α3/4 = 1 - \alpha exactly. The prediction interval [f^(X4)2.1,f^(X4)+2.1][\hat{f}(X_4) - 2.1, \hat{f}(X_4) + 2.1] covers Y4Y_4 with probability 0.750.75.

Notice the quantile is the largest of the three calibration residuals, not an interpolation. For small n2n_2 the ceiling rule is binding, and coverage can sit noticeably above 1α1 - \alpha. For n2100n_2 \geq 100 the slack is under 1%1\%.

What Split Conformal Does Not Give You

Marginal coverage is weaker than conditional coverage. The guarantee

P(Yn+1C^(Xn+1))1α\mathbb{P}\bigl(Y_{n+1} \in \hat{C}(X_{n+1})\bigr) \geq 1 - \alpha

averages over the randomness in Xn+1X_{n+1}. It does not guarantee

P(Yn+1C^(Xn+1)Xn+1=x)1α\mathbb{P}\bigl(Y_{n+1} \in \hat{C}(X_{n+1}) \mid X_{n+1} = x\bigr) \geq 1 - \alpha

for every xx. A split conformal predictor can have 90% marginal coverage while delivering 50% coverage on one subpopulation and 100% on another. This is not a bug of the construction. It is the price of making no assumptions.

Theorem

Conditional Coverage Is Distribution-Free Impossible

Statement

Let PXP_X be absolutely continuous on Rd\mathbb{R}^d. Any prediction set procedure C^\hat{C} that achieves distribution-free conditional coverage

P(Yn+1C^(Xn+1)Xn+1=x)1α\mathbb{P}\bigl(Y_{n+1} \in \hat{C}(X_{n+1}) \mid X_{n+1} = x\bigr) \geq 1 - \alpha

for every xx (uniformly in the distribution) must have infinite expected Lebesgue measure: E[Leb(C^(Xn+1))]=\mathbb{E}[\mathrm{Leb}(\hat{C}(X_{n+1}))] = \infty.

Intuition

A procedure that must cover at every xx, without any smoothness or structural assumption tying nearby xx values together, cannot borrow strength across the feature space. At a point never seen in training the procedure has no information, so it must return the whole space (or a set of positive measure under arbitrary distributions). Averaging such sets gives infinite expected size.

Why It Matters

Useful conditional coverage is purchasable only by imposing structural assumptions (smoothness, parametric form, localizability) or by relaxing the target (coverage over bands or groups rather than every point). Conformal prediction is honest about this. Parametric prediction intervals that claim conditional coverage implicitly rely on the model being correct; conformal does not make that claim in the first place.

Split conformal also makes no claim about set-size optimality. The intervals it produces have valid coverage but may be wider than necessary. The width depends entirely on the quality of the underlying predictor. Conformal prediction is a coverage-calibration tool, not a prediction-improvement tool.

Common Confusions

Watch Out

Conformal does not improve your predictor

Conformal prediction wraps a fitted model. If the predictor is poor, the conformal interval is wide. If the predictor is good, the interval is narrow. Either way the coverage probability is correct. Do not expect conformal calibration to recover signal the underlying model missed.

Watch Out

Marginal coverage is not conditional coverage

A 90% marginal guarantee says that averaged over XX, the interval covers YY ninety percent of the time. It says nothing about any particular subgroup. Split conformal can give 99% coverage on the majority subgroup and 50% on a minority subgroup, and still satisfy the 90% marginal guarantee. For group-level coverage you need group-conditional conformal or Mondrian conformal, which apply the construction within each group.

Watch Out

The test point enters the quantile

Many first implementations compute the (1α)(1 - \alpha)-quantile of the n2n_2 calibration scores and stop there. The correct quantile is (n2+1)(1α)/n2\lceil (n_2 + 1)(1 - \alpha) \rceil / n_2, which accounts for the test point being one of the n2+1n_2 + 1 exchangeable scores. Using (1α)(1 - \alpha) directly undercovers slightly for small n2n_2.

Watch Out

Split conformal uses half the data

Splitting sacrifices statistical efficiency: the predictor sees only n1n_1 training points and the calibration set sees only n2n_2. Full conformal prediction uses all nn points for both roles by refitting per candidate label, at exponential cost. Jackknife+ and CV+ recover most of the efficiency at polynomial cost and are worth knowing as intermediate options.

Forward Connection

The exchangeability assumption is where split conformal begins to strain in practice. Training data collected six months ago and test data collected today are rarely exchangeable. Models deployed across different populations face covariate shift by design. The natural extension is to weight the calibration points by the likelihood ratio between the test and training distributions, recovering coverage at the cost of estimating that ratio. That is the subject of the weighted conformal prediction page.

An orthogonal extension brings in sequential inference. Conformal prediction as stated holds at a fixed sample size. If an analyst peeks at coverage as more data arrives and stops when convenient, the marginal guarantee breaks. The fix uses e-values and test martingales, developed on the e-values and anytime-valid inference page.

Proofs to Rederive by Hand

Write it by hand

Reproducing these proofs from scratch is how you stop recognizing and start owning. Target: reproduce each in a few minutes after a week.

Split conformal coverage (the five-line argument). The rank-uniformity step is the template for nearly every conformal result. Start from "sn+1s_{n+1} has uniform rank on {1,,n2+1}\{1, \ldots, n_2 + 1\} among the exchangeable scores," convert that to P(sn+1s(k))=k/(n2+1)\mathbb{P}(s_{n+1} \leq s_{(k)}) = k/(n_2+1), and pick k=(n2+1)(1α)k = \lceil (n_2+1)(1-\alpha) \rceil. Know why the ceiling is there, not just that it is. After a week, target a clean reproduction in under three minutes.

Conditional coverage impossibility (the no-free-lunch construction). Build two distributions that agree on a training set but differ on a thin slice of xx-space. Any procedure with uniform coverage across distributions must cover on both, so at any xx in that slice the set must be the whole yy-axis. Average over PXP_X to get infinite expected Lebesgue measure. Writing this out makes precise why distribution-free conditional coverage costs the entire real line, and why parametric prediction intervals are paying that cost with unexamined assumptions instead.

Exercises

ExerciseCore

Problem

Let n2=99n_2 = 99 calibration residuals be the integers 1,2,,991, 2, \ldots, 99, and take α=0.1\alpha = 0.1. Compute q^\hat{q} and the resulting prediction interval width.

ExerciseCore

Problem

A colleague implements split conformal using the unadjusted (1α)(1-\alpha)-quantile of the calibration residuals. For α=0.1\alpha = 0.1 and n2=20n_2 = 20, quantify the worst-case undercoverage gap introduced by this shortcut.

ExerciseAdvanced

Problem

Construct a joint distribution on (X,Y)(X, Y) with X{0,1}X \in \{0, 1\}, P(X=0)=0.9\mathbb{P}(X = 0) = 0.9, a perfectly accurate predictor on the X=0X = 0 subgroup, and a completely uninformative predictor on the X=1X = 1 subgroup, such that split conformal achieves marginal coverage 0.9\geq 0.9 but conditional coverage on X=1X = 1 is 0.00.0. Explain what this implies for auditing conformal deployments.

ExerciseResearch

Problem

Jackknife+ (Barber, Candès, Ramdas, Tibshirani 2021) recovers most of the statistical efficiency lost by splitting, at polynomial rather than exponential cost. State the jackknife+ prediction set construction and identify the weakened coverage guarantee relative to split conformal. Under what predictor-stability condition is the guarantee strengthened back to 1α1 - \alpha?

Open Problems and Frontier

Distribution-free conditional coverage under minimal structural assumptions is open. Current partial results require smoothness or compactness conditions that rarely hold in high dimensions. The question of whether some intermediate notion between marginal and conditional coverage can be achieved distribution-free remains active.

Conformal prediction for dependent data (time series, spatial) requires either explicit modelling of the dependence or the nonexchangeable framework of Barber, Candès, Ramdas, Tibshirani (2023). Neither gives guarantees as clean as the i.i.d. case.

Computational shortcuts for full conformal prediction, which split conformal approximates by discarding half the data, are an ongoing line. Jackknife+ and CV+ partially close the efficiency gap; influence-function-based approximations are the current frontier.

Anytime-valid conformal prediction using e-values is an active direction. The standard procedure has a guarantee at a single fixed sample size. Extending to online settings where the analyst peeks at coverage as more data arrives requires the e-value machinery covered in a separate page.

References

Canonical:

  • Vovk, Gammerman, Shafer, Algorithmic Learning in a Random World (Springer, 2005). Chapters 2-3. The original book-length treatment.
  • Shafer, Vovk, "A Tutorial on Conformal Prediction." Journal of Machine Learning Research 9 (2008), 371-421.

Modern pedagogical:

  • Angelopoulos, Bates, "A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification." Foundations and Trends in Machine Learning 16(4) (2023), 494-591. The best single reference.
  • Lei, G'Sell, Rinaldo, Tibshirani, Wasserman, "Distribution-Free Predictive Inference for Regression." Journal of the American Statistical Association 113(523) (2018), 1094-1111.

Adaptive methods:

  • Romano, Patterson, Candès, "Conformalized Quantile Regression." NeurIPS 2019.
  • Romano, Sesia, Candès, "Classification with Valid and Adaptive Coverage." NeurIPS 2020.

Limits and extensions:

  • Barber, Candès, Ramdas, Tibshirani, "The Limits of Distribution-Free Conditional Predictive Inference." Information and Inference 10(2) (2021), 455-482.
  • Barber, Candès, Ramdas, Tibshirani, "Predictive Inference with the Jackknife+." Annals of Statistics 49(1) (2021), 486-507.
  • Barber, Candès, Ramdas, Tibshirani, "Conformal Prediction Beyond Exchangeability." Annals of Statistics 51(2) (2023), 816-845.

Next Topics

Last reviewed: April 24, 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics