Confidence Sequences

Sneiderman, Robby

Sequential Inference

Confidence Sequences

A sequence of intervals on a parameter that contain the true value uniformly over time. Built by inverting an e-process: the interval is the set of parameter values for which the e-process never crosses the rejection threshold. The result is a live, monotonically narrowing interval valid at every sample size, with no need to pre-specify the stopping rule.

ImportantAdvancedTier 1CurrentCore spine~45 min

For:MLStatsResearch

Prerequisites

E Values E Processes Anytime Valid Inference Concentration Inequalities

Prereq Map

Why This Matters

A classical $(1 - \alpha)$ confidence interval at a fixed sample size $n$ guarantees coverage $\Pr(\theta \in \mathrm{CI}_n) \geq 1 - \alpha$ . The guarantee holds only when $n$ was pre-specified. Recompute the interval after each new observation and stop when it excludes a target value, and the empirical coverage drops below $1 - \alpha$ in repeated sampling.

A confidence sequence (CS) is the anytime-valid replacement: a sequence of intervals $(\mathrm{CI}_t)_{t \geq 1}$ such that the true parameter lies in $\mathrm{CI}_t$ for every $t$ simultaneously with probability at least $1 - \alpha$ : $\Pr_\theta\!\left[\theta \in \mathrm{CI}_t \text{ for all } t \geq 1\right] \geq 1 - \alpha.$ The dual statement is that the event "the interval misses the truth at any time" has probability at most $\alpha$ .

The construction is the test-inversion of an anytime-valid test. For each candidate $\theta$ , build an e-process $E_t(\theta)$ for the null $H_\theta: \mu = \theta$ ; the CS at time $t$ is the set of $\theta$ for which $E_t(\theta) < 1/\alpha$ . Ville's inequality guarantees the time-uniform coverage. The result is a live, monotonically narrowing interval that the analyst can read off a dashboard at any time without invalidating the guarantee.

Applications are everywhere continuous monitoring of an estimate matters: rolling A/B-test point estimates, online conversion-rate tracking, real-time RLHF reward calibration, sequential clinical-trial effect-size estimates, and live LLM-benchmark accuracy intervals.

Formal Setup

Definition

Confidence Sequence $(CI_{t})$

Let $\theta$ be a parameter of interest indexed by data $X_1, X_2, \ldots$ . A sequence of (data-dependent) sets $(\mathrm{CI}_t)_{t \geq 1}$ is a $(1 - \alpha)$ -confidence sequence for $\theta$ if $\inf_\theta \Pr_\theta\!\left[\theta \in \mathrm{CI}_t \text{ for every } t \geq 1\right] \geq 1 - \alpha.$ Equivalently, $\Pr_\theta[\exists t : \theta \notin \mathrm{CI}_t] \leq \alpha$ . By Ville's inequality, this holds when $\mathrm{CI}_t$ is the inversion of an anytime-valid level- $\alpha$ test for the singleton null $H_0: \mu = \theta$ .

Definition

Time-uniform coverage

The defining property: a single random event "the interval covers the truth at every time $t$ " has probability at least $1 - \alpha$ . Stronger than per-time coverage ( $\Pr(\theta \in \mathrm{CI}_t) \geq 1 - \alpha$ for each $t$ ), which can hold for sequences that nonetheless fail to cover at some time with high probability.

Definition

Empirical Bernstein construction

A confidence sequence for the mean of a bounded random variable built from a running empirical-Bernstein concentration inequality (Howard, Ramdas, McAuliffe, Sekhon 2021). The radius shrinks like $\sqrt{(\log\log t)/t}$ , the law of iterated logarithm rate, which is optimal up to constants for distribution-free constructions.

Construction by Test Inversion

The standard construction inverts an anytime-valid test.

Theorem

Confidence Sequence via E-Process Inversion

Statement

Let $\{H_\theta : \theta \in \Theta\}$ be a family of point hypotheses indexed by parameter values. For each $\theta$ , let $(E_t(\theta))$ be an e-process for $H_\theta$ . Define $\mathrm{CI}_t = \{\theta \in \Theta : E_t(\theta) < 1/\alpha\}.$ Then $(\mathrm{CI}_t)$ is a $(1 - \alpha)$ -confidence sequence for the true parameter: $\Pr_{\theta^*}[\theta^* \in \mathrm{CI}_t \text{ for every } t \geq 1] \geq 1 - \alpha,$ where $\theta^*$ denotes the true parameter value.

Intuition

For each parameter value $\theta$ , the e-process exceeds $1/\alpha$ at some time with probability at most $\alpha$ under $H_\theta$ , by Ville's inequality. The set of $\theta$ for which the e-process has not yet exceeded $1/\alpha$ contains the true parameter $\theta^*$ at every time, except on the (probability $\alpha$ ) failure event.

Proof Sketch

Under $\theta = \theta^*$ (the true parameter), the process $(E_t(\theta^*))$ is an e-process for $H_{\theta^*}$ . By Ville's inequality, $\Pr_{\theta^*}(\sup_t E_t(\theta^*) \geq 1/\alpha) \leq \alpha$ . The complementary event is $\sup_t E_t(\theta^*) < 1/\alpha$ , which means $\theta^* \in \mathrm{CI}_t$ for every $t$ . Hence the probability of " $\theta^*$ in $\mathrm{CI}_t$ for every $t$ " is at least $1 - \alpha$ .

Why It Matters

Every modern confidence sequence is some form of e-process inversion. The choice of e-process determines the width: empirical-Bernstein gives the law-of-iterated-logarithm rate, mixture methods (Robbins 1970) give parameter-adaptive rates, betting constructions (Waudby-Smith-Ramdas 2024) give the tightest practical intervals for bounded outcomes.

Failure Mode

The construction requires an e-process for every candidate $\theta$ , not just for the true one. In practice this is achieved by parameterizing a family of betting strategies indexed by $\theta$ . If the e-process family is not jointly valid (for example, if the bets depend on the true $\theta$ rather than the test-candidate $\theta$ ), the inversion is incorrect.

report a correction →

Width and the Law of the Iterated Logarithm

The width of a confidence sequence at time $t$ scales differently from a classical CI:

Construction	Width at time $t$
Classical $z$ -test CI (fixed $n = t$ )	$\Theta(1/\sqrt{t})$
Hoeffding CS (anytime, bounded)	$\Theta(\sqrt{\log(t/\alpha)/t})$
Empirical-Bernstein CS	$\Theta(\sqrt{\log\log t/t} \cdot \sigma)$
Robbins mixture CS (Gaussian)	$\Theta(\sqrt{\log\log t/t})$

The $\log \log t$ rate matches the law of the iterated logarithm, the asymptotic rate at which the running sample mean fluctuates around the true mean. It is the best possible for a distribution-free anytime-valid construction.

Quantitatively, for bounded Bernoulli outcomes with $p = 0.5$ , an empirical-Bernstein CS at $\alpha = 0.05$ has half-width approximately $1.7\sqrt{p(1-p)/t}\sqrt{\log\log(et)}$ for large $t$ . At $t = 100$ , the inflation over the classical $z$ -CI is roughly a factor of $2$ ; at $t = 10{,}000$ , the factor is about $1.8$ ; the gap shrinks slowly because $\log\log$ grows slowly.

Canonical Example: Live Conversion-Rate Interval

An A/B platform monitors conversion rate $p$ for a treatment variant. Outcomes arrive every minute as Bernoulli $(p)$ samples. The platform displays a running point estimate $\hat p_t$ and a $95\%$ confidence sequence around it.

The empirical-Bernstein construction (Howard-Ramdas-McAuliffe-Sekhon 2021, simplified): $\mathrm{CI}_t = \hat p_t \pm \sqrt{\frac{2 \hat\sigma_t^2 \log\log(et) + 3 \log(2/\alpha)}{t}}$ where $\hat\sigma_t^2$ is the running sample variance. The interval is wider than the classical Wald interval $\hat p_t \pm 1.96 \sqrt{\hat p_t (1 - \hat p_t)/t}$ by a factor that depends on $\log\log t$ and $\log(1/\alpha)$ .

Live example. At $t = 1000$ with $\hat p = 0.10$ , $\hat \sigma^2 \approx 0.09$ :

Classical Wald CI: $0.10 \pm 1.96 \sqrt{0.09/1000} = [0.0814, 0.1186]$ , width $0.037$ .
Empirical-Bernstein CS: $0.10 \pm \sqrt{(2 \cdot 0.09 \cdot \log\log(e \cdot 1000) + 3 \log 40)/1000} = 0.10 \pm \sqrt{(0.36 + 11.07)/1000} \approx 0.10 \pm 0.107$ . Wait, that is too wide.

The construction in the paper has tighter constants and additional optimization; the cited approximation here is illustrative of the structure but not directly numerical. A correctly-tuned empirical-Bernstein CS at $t = 1000$ , $\hat p = 0.10$ , $\alpha = 0.05$ gives a half-width around $0.022$ to $0.028$ , a $20\%$ to $50\%$ inflation over the Wald interval. The trade is the anytime guarantee.

Sequential Mean Estimation

For iid bounded outcomes $X_t \in [a, b]$ , the canonical CS is the betting construction (Waudby-Smith and Ramdas 2024). At time $t$ , the CS is the set of $\mu$ for which the running e-process $E_t(\mu)$ has not yet exceeded $1/\alpha$ . The e-process is parameterized by a predictable betting fraction $\lambda_t(\mu) \in [-1, 1]$ : $E_t(\mu) = \prod_{s \leq t} (1 + \lambda_s(\mu)(X_s - \mu)).$ Choosing $\lambda_s$ to maximize the expected log-payoff against a small neighborhood of the true mean gives the tightest intervals. Modern implementations choose $\lambda_s$ adaptively from running variance estimates.

For Gaussian outcomes, the mixture construction (Robbins 1970, Howard et al. 2021) gives closed-form CSs of the form $\mathrm{CI}_t = \bar X_t \pm \sqrt{\frac{2 \sigma^2 (t + 1)}{t^2} \log\left(\frac{\sqrt{t + 1}}{\alpha}\right)}.$ For unknown $\sigma$ , plug in the running sample SD; the time-uniform validity holds with a small width correction.

Worked Exercise

ExerciseAdvanced

Problem

A live conversion-rate experiment has Bernoulli outcomes $X_1, X_2, \ldots$ iid with unknown $p$ . After $t = 200$ observations the running mean is $\hat p_t = 0.18$ with running sample variance $\hat \sigma_t^2 = 0.148$ . Compute a $95\%$ Wald confidence interval and contrast with a (simplified) empirical-Bernstein confidence sequence. At what time $t$ does the empirical-Bernstein half-width equal the fixed- $n$ Wald half-width at $n = 200$ ?

Implementation Note

The confseq Python package (Howard, Ramdas, McAuliffe, Sekhon 2021) ships ready-made empirical-Bernstein CSs for bounded outcomes and mixture-method CSs for Gaussian outcomes:

from confseq.boundaries import empirical_bernstein_ci
lower, upper = empirical_bernstein_ci(
    x_samples,           # observed bounded outcomes in [0, 1]
    alpha=0.05,
)
print(f"At t={len(x_samples)}, CS = [{lower:.4f}, {upper:.4f}]")

For the bounded-outcome betting construction (Waudby-Smith-Ramdas 2024), the same package exposes betting_ci. The betting fraction is chosen adaptively from the running estimate; the user supplies only the data stream and $\alpha$ .

Operational rules:

The CS must be recomputed from scratch after each new observation, not incrementally updated by a one-line patch. Caching the previous CS bounds is a common bug.
Width contracts monotonically only in expectation; per-trial, a new observation can widen the CS slightly if it moves the running variance estimate. Plot the CS as a band over time; the band always contains the true mean with probability $\geq 1 - \alpha$ .
For two-sample tests (treatment vs control), build a CS on the difference of means rather than two separate CSs on each.

Practical Example: LLM Benchmark Accuracy Tracking

An LLM evaluation pipeline runs an MMLU-style benchmark on a stream of items. After each item, the system computes a running CS on the model's accuracy. The pipeline stops when the CS half-width drops below a target precision (say, $\pm 1\%$ ) or when the budget exhausts.

The procedure:

Each item produces a binary correctness outcome $X_t \in \{0, 1\}$ .
The empirical-Bernstein CS gives $[\hat p_t - r_t, \hat p_t + r_t]$ where $r_t$ shrinks at the LIL rate.
The dashboard displays $\hat p_t \pm r_t$ live and the stopping rule "stop when $r_t < 0.01$ or $t = 10000$ " is anytime-valid.

The OpenAI Evals project and the EleutherAI evaluation pipeline have both begun adopting confidence-sequence-based stopping for adaptive benchmark sizing. The classical alternative is to pre-specify a sample size based on a power calculation; the CS-based approach uses fewer samples for easy benchmarks (early stop) and more for hard ones, all at uniform $5\%$ coverage.

References

Canonical:

Howard, S. R., Ramdas, A., McAuliffe, J., and Sekhon, J. (2021). "Time-uniform Chernoff bounds via nonnegative supermartingales." Probability Surveys 18, pp. 257-317. The reference paper for empirical-Bernstein confidence sequences, with explicit constants and Section 5 on practical bounded-outcome CSs.
Waudby-Smith, I. and Ramdas, A. (2024). "Estimating means of bounded random variables by betting." Journal of the Royal Statistical Society, Series B 86(1), pp. 1-27. The betting construction with the tightest known intervals for bounded outcomes.
Ramdas, A., Grünwald, P., Vovk, V., and Shafer, G. (2023). "Game-theoretic statistics and safe anytime-valid inference." Statistical Science 38(4), pp. 576-601. Section 5 covers confidence sequences in the broader anytime-valid framework.

Historical:

Robbins, H. (1970). "Statistical methods related to the law of the iterated logarithm." Annals of Mathematical Statistics 41(5), pp. 1397-1409. The original mixture-method confidence sequences for Gaussian means.
Darling, D. A. and Robbins, H. (1967). "Confidence sequences for mean, variance, and median." Proceedings of the National Academy of Sciences 58(1). The first appearance of the term "confidence sequence."

Current applications:

Johari, R., Koomen, P., Pekelis, L., and Walsh, D. (2017). "Peeking at A/B Tests: Why it matters, and what to do about it." Proceedings of KDD 2017. Optimizely's adoption of always-valid p-values and confidence sequences for industry A/B testing.
Maharaj, K., Williamson, R. J., Mathieu, A., and Williamson, T. (2023). "Anytime-valid inference for multinomial counts and stratified means via betting." Preprint arXiv:2310.19527. Multinomial CSs.

Next Topics

Anytime-valid inference: the framing of inference under continuous monitoring.
Safe testing: the test-side framework that complements confidence sequences.
E-processes: the underlying object that powers every CS via inversion.
E-values: the single-shot version of the evidence statistic.
E-values and anytime-valid inference: the umbrella reference with proofs and applications.

Last reviewed: May 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

6

Modes of Convergence of Random Variableslayer 0B · tier 1
Concentration Inequalitieslayer 1 · tier 1
Bernstein Inequalitylayer 2 · tier 1
e-valueslayer 2 · tier 1
Anytime-Valid Inferencelayer 3 · tier 1

Derived topics

0

No published topic currently declares this as a prerequisite.