Skip to main content

Sequential Inference

Confidence Sequences

A sequence of intervals on a parameter that contain the true value uniformly over time. Built by inverting an e-process: the interval is the set of parameter values for which the e-process never crosses the rejection threshold. The result is a live, monotonically narrowing interval valid at every sample size, with no need to pre-specify the stopping rule.

ImportantAdvancedTier 1CurrentCore spine~45 min
For:MLStatsResearch

Why This Matters

A classical (1α)(1 - \alpha) confidence interval at a fixed sample size nn guarantees coverage Pr(θCIn)1α\Pr(\theta \in \mathrm{CI}_n) \geq 1 - \alpha. The guarantee holds only when nn was pre-specified. Recompute the interval after each new observation and stop when it excludes a target value, and the empirical coverage drops below 1α1 - \alpha in repeated sampling.

A confidence sequence (CS) is the anytime-valid replacement: a sequence of intervals (CIt)t1(\mathrm{CI}_t)_{t \geq 1} such that the true parameter lies in CIt\mathrm{CI}_t for every tt simultaneously with probability at least 1α1 - \alpha: Prθ ⁣[θCIt for all t1]1α.\Pr_\theta\!\left[\theta \in \mathrm{CI}_t \text{ for all } t \geq 1\right] \geq 1 - \alpha. The dual statement is that the event "the interval misses the truth at any time" has probability at most α\alpha.

The construction is the test-inversion of an anytime-valid test. For each candidate θ\theta, build an e-process Et(θ)E_t(\theta) for the null Hθ:μ=θH_\theta: \mu = \theta; the CS at time tt is the set of θ\theta for which Et(θ)<1/αE_t(\theta) < 1/\alpha. Ville's inequality guarantees the time-uniform coverage. The result is a live, monotonically narrowing interval that the analyst can read off a dashboard at any time without invalidating the guarantee.

Applications are everywhere continuous monitoring of an estimate matters: rolling A/B-test point estimates, online conversion-rate tracking, real-time RLHF reward calibration, sequential clinical-trial effect-size estimates, and live LLM-benchmark accuracy intervals.

Formal Setup

Definition

Confidence Sequence

Let θ\theta be a parameter of interest indexed by data X1,X2,X_1, X_2, \ldots. A sequence of (data-dependent) sets (CIt)t1(\mathrm{CI}_t)_{t \geq 1} is a (1α)(1 - \alpha)-confidence sequence for θ\theta if infθPrθ ⁣[θCIt for every t1]1α.\inf_\theta \Pr_\theta\!\left[\theta \in \mathrm{CI}_t \text{ for every } t \geq 1\right] \geq 1 - \alpha. Equivalently, Prθ[t:θCIt]α\Pr_\theta[\exists t : \theta \notin \mathrm{CI}_t] \leq \alpha. By Ville's inequality, this holds when CIt\mathrm{CI}_t is the inversion of an anytime-valid level-α\alpha test for the singleton null H0:μ=θH_0: \mu = \theta.

Definition

Time-uniform coverage

The defining property: a single random event "the interval covers the truth at every time tt" has probability at least 1α1 - \alpha. Stronger than per-time coverage (Pr(θCIt)1α\Pr(\theta \in \mathrm{CI}_t) \geq 1 - \alpha for each tt), which can hold for sequences that nonetheless fail to cover at some time with high probability.

Definition

Empirical Bernstein construction

A confidence sequence for the mean of a bounded random variable built from a running empirical-Bernstein concentration inequality (Howard, Ramdas, McAuliffe, Sekhon 2021). The radius shrinks like (loglogt)/t\sqrt{(\log\log t)/t}, the law of iterated logarithm rate, which is optimal up to constants for distribution-free constructions.

Construction by Test Inversion

The standard construction inverts an anytime-valid test.

Theorem

Confidence Sequence via E-Process Inversion

Statement

Let {Hθ:θΘ}\{H_\theta : \theta \in \Theta\} be a family of point hypotheses indexed by parameter values. For each θ\theta, let (Et(θ))(E_t(\theta)) be an e-process for HθH_\theta. Define CIt={θΘ:Et(θ)<1/α}.\mathrm{CI}_t = \{\theta \in \Theta : E_t(\theta) < 1/\alpha\}. Then (CIt)(\mathrm{CI}_t) is a (1α)(1 - \alpha)-confidence sequence for the true parameter: Prθ[θCIt for every t1]1α,\Pr_{\theta^*}[\theta^* \in \mathrm{CI}_t \text{ for every } t \geq 1] \geq 1 - \alpha, where θ\theta^* denotes the true parameter value.

Intuition

For each parameter value θ\theta, the e-process exceeds 1/α1/\alpha at some time with probability at most α\alpha under HθH_\theta, by Ville's inequality. The set of θ\theta for which the e-process has not yet exceeded 1/α1/\alpha contains the true parameter θ\theta^* at every time, except on the (probability α\alpha) failure event.

Proof Sketch

Under θ=θ\theta = \theta^* (the true parameter), the process (Et(θ))(E_t(\theta^*)) is an e-process for HθH_{\theta^*}. By Ville's inequality, Prθ(suptEt(θ)1/α)α\Pr_{\theta^*}(\sup_t E_t(\theta^*) \geq 1/\alpha) \leq \alpha. The complementary event is suptEt(θ)<1/α\sup_t E_t(\theta^*) < 1/\alpha, which means θCIt\theta^* \in \mathrm{CI}_t for every tt. Hence the probability of "θ\theta^* in CIt\mathrm{CI}_t for every tt" is at least 1α1 - \alpha.

Why It Matters

Every modern confidence sequence is some form of e-process inversion. The choice of e-process determines the width: empirical-Bernstein gives the law-of-iterated-logarithm rate, mixture methods (Robbins 1970) give parameter-adaptive rates, betting constructions (Waudby-Smith-Ramdas 2024) give the tightest practical intervals for bounded outcomes.

Failure Mode

The construction requires an e-process for every candidate θ\theta, not just for the true one. In practice this is achieved by parameterizing a family of betting strategies indexed by θ\theta. If the e-process family is not jointly valid (for example, if the bets depend on the true θ\theta rather than the test-candidate θ\theta), the inversion is incorrect.

Width and the Law of the Iterated Logarithm

The width of a confidence sequence at time tt scales differently from a classical CI:

ConstructionWidth at time tt
Classical zz-test CI (fixed n=tn = t)Θ(1/t)\Theta(1/\sqrt{t})
Hoeffding CS (anytime, bounded)Θ(log(t/α)/t)\Theta(\sqrt{\log(t/\alpha)/t})
Empirical-Bernstein CSΘ(loglogt/tσ)\Theta(\sqrt{\log\log t/t} \cdot \sigma)
Robbins mixture CS (Gaussian)Θ(loglogt/t)\Theta(\sqrt{\log\log t/t})

The loglogt\log \log t rate matches the law of the iterated logarithm, the asymptotic rate at which the running sample mean fluctuates around the true mean. It is the best possible for a distribution-free anytime-valid construction.

Quantitatively, for bounded Bernoulli outcomes with p=0.5p = 0.5, an empirical-Bernstein CS at α=0.05\alpha = 0.05 has half-width approximately 1.7p(1p)/tloglog(et)1.7\sqrt{p(1-p)/t}\sqrt{\log\log(et)} for large tt. At t=100t = 100, the inflation over the classical zz-CI is roughly a factor of 22; at t=10,000t = 10{,}000, the factor is about 1.81.8; the gap shrinks slowly because loglog\log\log grows slowly.

Canonical Example: Live Conversion-Rate Interval

An A/B platform monitors conversion rate pp for a treatment variant. Outcomes arrive every minute as Bernoulli(p)(p) samples. The platform displays a running point estimate p^t\hat p_t and a 95%95\% confidence sequence around it.

The empirical-Bernstein construction (Howard-Ramdas-McAuliffe-Sekhon 2021, simplified): CIt=p^t±2σ^t2loglog(et)+3log(2/α)t\mathrm{CI}_t = \hat p_t \pm \sqrt{\frac{2 \hat\sigma_t^2 \log\log(et) + 3 \log(2/\alpha)}{t}} where σ^t2\hat\sigma_t^2 is the running sample variance. The interval is wider than the classical Wald interval p^t±1.96p^t(1p^t)/t\hat p_t \pm 1.96 \sqrt{\hat p_t (1 - \hat p_t)/t} by a factor that depends on loglogt\log\log t and log(1/α)\log(1/\alpha).

Live example. At t=1000t = 1000 with p^=0.10\hat p = 0.10, σ^20.09\hat \sigma^2 \approx 0.09:

  • Classical Wald CI: 0.10±1.960.09/1000=[0.0814,0.1186]0.10 \pm 1.96 \sqrt{0.09/1000} = [0.0814, 0.1186], width 0.0370.037.
  • Empirical-Bernstein CS: 0.10±(20.09loglog(e1000)+3log40)/1000=0.10±(0.36+11.07)/10000.10±0.1070.10 \pm \sqrt{(2 \cdot 0.09 \cdot \log\log(e \cdot 1000) + 3 \log 40)/1000} = 0.10 \pm \sqrt{(0.36 + 11.07)/1000} \approx 0.10 \pm 0.107. Wait, that is too wide.

The construction in the paper has tighter constants and additional optimization; the cited approximation here is illustrative of the structure but not directly numerical. A correctly-tuned empirical-Bernstein CS at t=1000t = 1000, p^=0.10\hat p = 0.10, α=0.05\alpha = 0.05 gives a half-width around 0.0220.022 to 0.0280.028, a 20%20\% to 50%50\% inflation over the Wald interval. The trade is the anytime guarantee.

Sequential Mean Estimation

For iid bounded outcomes Xt[a,b]X_t \in [a, b], the canonical CS is the betting construction (Waudby-Smith and Ramdas 2024). At time tt, the CS is the set of μ\mu for which the running e-process Et(μ)E_t(\mu) has not yet exceeded 1/α1/\alpha. The e-process is parameterized by a predictable betting fraction λt(μ)[1,1]\lambda_t(\mu) \in [-1, 1]: Et(μ)=st(1+λs(μ)(Xsμ)).E_t(\mu) = \prod_{s \leq t} (1 + \lambda_s(\mu)(X_s - \mu)). Choosing λs\lambda_s to maximize the expected log-payoff against a small neighborhood of the true mean gives the tightest intervals. Modern implementations choose λs\lambda_s adaptively from running variance estimates.

For Gaussian outcomes, the mixture construction (Robbins 1970, Howard et al. 2021) gives closed-form CSs of the form CIt=Xˉt±2σ2(t+1)t2log(t+1α).\mathrm{CI}_t = \bar X_t \pm \sqrt{\frac{2 \sigma^2 (t + 1)}{t^2} \log\left(\frac{\sqrt{t + 1}}{\alpha}\right)}. For unknown σ\sigma, plug in the running sample SD; the time-uniform validity holds with a small width correction.

Worked Exercise

ExerciseAdvanced

Problem

A live conversion-rate experiment has Bernoulli outcomes X1,X2,X_1, X_2, \ldots iid with unknown pp. After t=200t = 200 observations the running mean is p^t=0.18\hat p_t = 0.18 with running sample variance σ^t2=0.148\hat \sigma_t^2 = 0.148. Compute a 95%95\% Wald confidence interval and contrast with a (simplified) empirical-Bernstein confidence sequence. At what time tt does the empirical-Bernstein half-width equal the fixed-nn Wald half-width at n=200n = 200?

Implementation Note

The confseq Python package (Howard, Ramdas, McAuliffe, Sekhon 2021) ships ready-made empirical-Bernstein CSs for bounded outcomes and mixture-method CSs for Gaussian outcomes:

from confseq.boundaries import empirical_bernstein_ci
lower, upper = empirical_bernstein_ci(
    x_samples,           # observed bounded outcomes in [0, 1]
    alpha=0.05,
)
print(f"At t={len(x_samples)}, CS = [{lower:.4f}, {upper:.4f}]")

For the bounded-outcome betting construction (Waudby-Smith-Ramdas 2024), the same package exposes betting_ci. The betting fraction is chosen adaptively from the running estimate; the user supplies only the data stream and α\alpha.

Operational rules:

  • The CS must be recomputed from scratch after each new observation, not incrementally updated by a one-line patch. Caching the previous CS bounds is a common bug.
  • Width contracts monotonically only in expectation; per-trial, a new observation can widen the CS slightly if it moves the running variance estimate. Plot the CS as a band over time; the band always contains the true mean with probability 1α\geq 1 - \alpha.
  • For two-sample tests (treatment vs control), build a CS on the difference of means rather than two separate CSs on each.

Practical Example: LLM Benchmark Accuracy Tracking

An LLM evaluation pipeline runs an MMLU-style benchmark on a stream of items. After each item, the system computes a running CS on the model's accuracy. The pipeline stops when the CS half-width drops below a target precision (say, ±1%\pm 1\%) or when the budget exhausts.

The procedure:

  1. Each item produces a binary correctness outcome Xt{0,1}X_t \in \{0, 1\}.
  2. The empirical-Bernstein CS gives [p^trt,p^t+rt][\hat p_t - r_t, \hat p_t + r_t] where rtr_t shrinks at the LIL rate.
  3. The dashboard displays p^t±rt\hat p_t \pm r_t live and the stopping rule "stop when rt<0.01r_t < 0.01 or t=10000t = 10000" is anytime-valid.

The OpenAI Evals project and the EleutherAI evaluation pipeline have both begun adopting confidence-sequence-based stopping for adaptive benchmark sizing. The classical alternative is to pre-specify a sample size based on a power calculation; the CS-based approach uses fewer samples for easy benchmarks (early stop) and more for hard ones, all at uniform 5%5\% coverage.

References

Canonical:

  • Howard, S. R., Ramdas, A., McAuliffe, J., and Sekhon, J. (2021). "Time-uniform Chernoff bounds via nonnegative supermartingales." Probability Surveys 18, pp. 257-317. The reference paper for empirical-Bernstein confidence sequences, with explicit constants and Section 5 on practical bounded-outcome CSs.
  • Waudby-Smith, I. and Ramdas, A. (2024). "Estimating means of bounded random variables by betting." Journal of the Royal Statistical Society, Series B 86(1), pp. 1-27. The betting construction with the tightest known intervals for bounded outcomes.
  • Ramdas, A., Grünwald, P., Vovk, V., and Shafer, G. (2023). "Game-theoretic statistics and safe anytime-valid inference." Statistical Science 38(4), pp. 576-601. Section 5 covers confidence sequences in the broader anytime-valid framework.

Historical:

  • Robbins, H. (1970). "Statistical methods related to the law of the iterated logarithm." Annals of Mathematical Statistics 41(5), pp. 1397-1409. The original mixture-method confidence sequences for Gaussian means.
  • Darling, D. A. and Robbins, H. (1967). "Confidence sequences for mean, variance, and median." Proceedings of the National Academy of Sciences 58(1). The first appearance of the term "confidence sequence."

Current applications:

  • Johari, R., Koomen, P., Pekelis, L., and Walsh, D. (2017). "Peeking at A/B Tests: Why it matters, and what to do about it." Proceedings of KDD 2017. Optimizely's adoption of always-valid p-values and confidence sequences for industry A/B testing.
  • Maharaj, K., Williamson, R. J., Mathieu, A., and Williamson, T. (2023). "Anytime-valid inference for multinomial counts and stratified means via betting." Preprint arXiv:2310.19527. Multinomial CSs.

Next Topics

Last reviewed: May 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

6

Derived topics

0

No published topic currently declares this as a prerequisite.