Sequential Inference
Confidence Sequences
A sequence of intervals on a parameter that contain the true value uniformly over time. Built by inverting an e-process: the interval is the set of parameter values for which the e-process never crosses the rejection threshold. The result is a live, monotonically narrowing interval valid at every sample size, with no need to pre-specify the stopping rule.
Why This Matters
A classical confidence interval at a fixed sample size guarantees coverage . The guarantee holds only when was pre-specified. Recompute the interval after each new observation and stop when it excludes a target value, and the empirical coverage drops below in repeated sampling.
A confidence sequence (CS) is the anytime-valid replacement: a sequence of intervals such that the true parameter lies in for every simultaneously with probability at least : The dual statement is that the event "the interval misses the truth at any time" has probability at most .
The construction is the test-inversion of an anytime-valid test. For each candidate , build an e-process for the null ; the CS at time is the set of for which . Ville's inequality guarantees the time-uniform coverage. The result is a live, monotonically narrowing interval that the analyst can read off a dashboard at any time without invalidating the guarantee.
Applications are everywhere continuous monitoring of an estimate matters: rolling A/B-test point estimates, online conversion-rate tracking, real-time RLHF reward calibration, sequential clinical-trial effect-size estimates, and live LLM-benchmark accuracy intervals.
Formal Setup
Confidence Sequence
Let be a parameter of interest indexed by data . A sequence of (data-dependent) sets is a -confidence sequence for if Equivalently, . By Ville's inequality, this holds when is the inversion of an anytime-valid level- test for the singleton null .
Time-uniform coverage
The defining property: a single random event "the interval covers the truth at every time " has probability at least . Stronger than per-time coverage ( for each ), which can hold for sequences that nonetheless fail to cover at some time with high probability.
Empirical Bernstein construction
A confidence sequence for the mean of a bounded random variable built from a running empirical-Bernstein concentration inequality (Howard, Ramdas, McAuliffe, Sekhon 2021). The radius shrinks like , the law of iterated logarithm rate, which is optimal up to constants for distribution-free constructions.
Construction by Test Inversion
The standard construction inverts an anytime-valid test.
Confidence Sequence via E-Process Inversion
Statement
Let be a family of point hypotheses indexed by parameter values. For each , let be an e-process for . Define Then is a -confidence sequence for the true parameter: where denotes the true parameter value.
Intuition
For each parameter value , the e-process exceeds at some time with probability at most under , by Ville's inequality. The set of for which the e-process has not yet exceeded contains the true parameter at every time, except on the (probability ) failure event.
Proof Sketch
Under (the true parameter), the process is an e-process for . By Ville's inequality, . The complementary event is , which means for every . Hence the probability of " in for every " is at least .
Why It Matters
Every modern confidence sequence is some form of e-process inversion. The choice of e-process determines the width: empirical-Bernstein gives the law-of-iterated-logarithm rate, mixture methods (Robbins 1970) give parameter-adaptive rates, betting constructions (Waudby-Smith-Ramdas 2024) give the tightest practical intervals for bounded outcomes.
Failure Mode
The construction requires an e-process for every candidate , not just for the true one. In practice this is achieved by parameterizing a family of betting strategies indexed by . If the e-process family is not jointly valid (for example, if the bets depend on the true rather than the test-candidate ), the inversion is incorrect.
Width and the Law of the Iterated Logarithm
The width of a confidence sequence at time scales differently from a classical CI:
| Construction | Width at time |
|---|---|
| Classical -test CI (fixed ) | |
| Hoeffding CS (anytime, bounded) | |
| Empirical-Bernstein CS | |
| Robbins mixture CS (Gaussian) |
The rate matches the law of the iterated logarithm, the asymptotic rate at which the running sample mean fluctuates around the true mean. It is the best possible for a distribution-free anytime-valid construction.
Quantitatively, for bounded Bernoulli outcomes with , an empirical-Bernstein CS at has half-width approximately for large . At , the inflation over the classical -CI is roughly a factor of ; at , the factor is about ; the gap shrinks slowly because grows slowly.
Canonical Example: Live Conversion-Rate Interval
An A/B platform monitors conversion rate for a treatment variant. Outcomes arrive every minute as Bernoulli samples. The platform displays a running point estimate and a confidence sequence around it.
The empirical-Bernstein construction (Howard-Ramdas-McAuliffe-Sekhon 2021, simplified): where is the running sample variance. The interval is wider than the classical Wald interval by a factor that depends on and .
Live example. At with , :
- Classical Wald CI: , width .
- Empirical-Bernstein CS: . Wait, that is too wide.
The construction in the paper has tighter constants and additional optimization; the cited approximation here is illustrative of the structure but not directly numerical. A correctly-tuned empirical-Bernstein CS at , , gives a half-width around to , a to inflation over the Wald interval. The trade is the anytime guarantee.
Sequential Mean Estimation
For iid bounded outcomes , the canonical CS is the betting construction (Waudby-Smith and Ramdas 2024). At time , the CS is the set of for which the running e-process has not yet exceeded . The e-process is parameterized by a predictable betting fraction : Choosing to maximize the expected log-payoff against a small neighborhood of the true mean gives the tightest intervals. Modern implementations choose adaptively from running variance estimates.
For Gaussian outcomes, the mixture construction (Robbins 1970, Howard et al. 2021) gives closed-form CSs of the form For unknown , plug in the running sample SD; the time-uniform validity holds with a small width correction.
Worked Exercise
Problem
A live conversion-rate experiment has Bernoulli outcomes iid with unknown . After observations the running mean is with running sample variance . Compute a Wald confidence interval and contrast with a (simplified) empirical-Bernstein confidence sequence. At what time does the empirical-Bernstein half-width equal the fixed- Wald half-width at ?
Implementation Note
The confseq Python package (Howard, Ramdas, McAuliffe, Sekhon 2021) ships ready-made empirical-Bernstein CSs for bounded outcomes and mixture-method CSs for Gaussian outcomes:
from confseq.boundaries import empirical_bernstein_ci
lower, upper = empirical_bernstein_ci(
x_samples, # observed bounded outcomes in [0, 1]
alpha=0.05,
)
print(f"At t={len(x_samples)}, CS = [{lower:.4f}, {upper:.4f}]")
For the bounded-outcome betting construction (Waudby-Smith-Ramdas 2024), the same package exposes betting_ci. The betting fraction is chosen adaptively from the running estimate; the user supplies only the data stream and .
Operational rules:
- The CS must be recomputed from scratch after each new observation, not incrementally updated by a one-line patch. Caching the previous CS bounds is a common bug.
- Width contracts monotonically only in expectation; per-trial, a new observation can widen the CS slightly if it moves the running variance estimate. Plot the CS as a band over time; the band always contains the true mean with probability .
- For two-sample tests (treatment vs control), build a CS on the difference of means rather than two separate CSs on each.
Practical Example: LLM Benchmark Accuracy Tracking
An LLM evaluation pipeline runs an MMLU-style benchmark on a stream of items. After each item, the system computes a running CS on the model's accuracy. The pipeline stops when the CS half-width drops below a target precision (say, ) or when the budget exhausts.
The procedure:
- Each item produces a binary correctness outcome .
- The empirical-Bernstein CS gives where shrinks at the LIL rate.
- The dashboard displays live and the stopping rule "stop when or " is anytime-valid.
The OpenAI Evals project and the EleutherAI evaluation pipeline have both begun adopting confidence-sequence-based stopping for adaptive benchmark sizing. The classical alternative is to pre-specify a sample size based on a power calculation; the CS-based approach uses fewer samples for easy benchmarks (early stop) and more for hard ones, all at uniform coverage.
References
Canonical:
- Howard, S. R., Ramdas, A., McAuliffe, J., and Sekhon, J. (2021). "Time-uniform Chernoff bounds via nonnegative supermartingales." Probability Surveys 18, pp. 257-317. The reference paper for empirical-Bernstein confidence sequences, with explicit constants and Section 5 on practical bounded-outcome CSs.
- Waudby-Smith, I. and Ramdas, A. (2024). "Estimating means of bounded random variables by betting." Journal of the Royal Statistical Society, Series B 86(1), pp. 1-27. The betting construction with the tightest known intervals for bounded outcomes.
- Ramdas, A., Grünwald, P., Vovk, V., and Shafer, G. (2023). "Game-theoretic statistics and safe anytime-valid inference." Statistical Science 38(4), pp. 576-601. Section 5 covers confidence sequences in the broader anytime-valid framework.
Historical:
- Robbins, H. (1970). "Statistical methods related to the law of the iterated logarithm." Annals of Mathematical Statistics 41(5), pp. 1397-1409. The original mixture-method confidence sequences for Gaussian means.
- Darling, D. A. and Robbins, H. (1967). "Confidence sequences for mean, variance, and median." Proceedings of the National Academy of Sciences 58(1). The first appearance of the term "confidence sequence."
Current applications:
- Johari, R., Koomen, P., Pekelis, L., and Walsh, D. (2017). "Peeking at A/B Tests: Why it matters, and what to do about it." Proceedings of KDD 2017. Optimizely's adoption of always-valid p-values and confidence sequences for industry A/B testing.
- Maharaj, K., Williamson, R. J., Mathieu, A., and Williamson, T. (2023). "Anytime-valid inference for multinomial counts and stratified means via betting." Preprint arXiv:2310.19527. Multinomial CSs.
Next Topics
- Anytime-valid inference: the framing of inference under continuous monitoring.
- Safe testing: the test-side framework that complements confidence sequences.
- E-processes: the underlying object that powers every CS via inversion.
- E-values: the single-shot version of the evidence statistic.
- E-values and anytime-valid inference: the umbrella reference with proofs and applications.
Last reviewed: May 13, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
6- Modes of Convergence of Random Variableslayer 0B · tier 1
- Concentration Inequalitieslayer 1 · tier 1
- Bernstein Inequalitylayer 2 · tier 1
- e-valueslayer 2 · tier 1
- Anytime-Valid Inferencelayer 3 · tier 1
Derived topics
0No published topic currently declares this as a prerequisite.