Skip to main content

Sequential Inference

Anytime-Valid Inference

A framework where statistical guarantees hold simultaneously at every stopping time, not just at a pre-specified sample size. Built on e-processes and Ville's inequality. The decision rule and the stopping rule can both depend on data without inflating Type I error. The technical setting behind continuous A/B-test monitoring, adaptive clinical trials, and rolling LLM evaluations.

AdvancedResearchTier 1CurrentCore spine~45 min
For:MLStatsResearch

Why This Matters

The dominant pattern of modern statistical practice is continuous monitoring. An A/B testing platform shows interim results every hour. A clinical trial has a Data Safety Monitoring Board reviewing accumulating outcomes. An LLM evaluation pipeline ingests new benchmark items as they become available and updates a dashboard. In each case, an analyst is looking at the data many times and is allowed (encouraged, often) to stop early when the evidence is convincing.

Classical fixed-nn tests offer no guarantee under this pattern. The pp-value from a zz-test at n=1000n = 1000 controls Type I error only if nn was chosen before any data were seen. Peeking at n=100,200,n = 100, 200, \ldots and stopping at the first nn where p0.05p \leq 0.05 inflates the false-positive rate well past 5%5\%. The classical fix is to pre-register the look schedule and use Pocock or O'Brien-Fleming boundaries, but those are rigid and not robust to the actual decision rule used by practitioners.

Anytime-valid inference is the formal framework that handles this case. It guarantees a single Type I error level α\alpha that holds at every stopping time simultaneously, including stopping times that depend on the running data. Built on e-processes and Ville's inequality, the framework recovers classical results in the fixed-nn regime and degrades gracefully (with a small power loss) when used at a single fixed sample size.

Formal Setup

Definition

Anytime-valid test

A sequential test of H0H_0 is anytime-valid at level α\alpha if there exists a rejection rule based on the running data such that, for every stopping time τ\tau (possibly random and possibly \infty), supPH0PrP[reject at τ,τ<]α.\sup_{P \in H_0} \Pr_P[\text{reject at } \tau, \, \tau < \infty] \leq \alpha. The standard construction is to reject the first time the e-process EtE_t exceeds 1/α1/\alpha.

Definition

Anytime-valid confidence sequence

A sequence of intervals (CIt)(\mathrm{CI}_t) on a parameter θ\theta is a (1α)(1 - \alpha)-anytime-valid confidence sequence (CS) if Prθ ⁣[θCIt for every t]1α.\Pr_\theta\!\left[\theta \in \mathrm{CI}_t \text{ for every } t\right] \geq 1 - \alpha. Equivalent dual statement: Prθ[t:θCIt]α\Pr_\theta[\exists t : \theta \notin \mathrm{CI}_t] \leq \alpha. The interval is the inversion of an anytime-valid test of the singleton null H0:μ=θH_0: \mu = \theta.

Definition

Optional continuation

The dual of optional stopping. The analyst is allowed to continue collecting data past a planned nn if the evidence is inconclusive. Like optional stopping, this breaks classical pp-values but is preserved by e-processes.

The Core Guarantee

The technical content of anytime-valid inference is one application of Ville's inequality.

Theorem

Anytime-Valid Type I Error Control

Statement

Let (Et)t0(E_t)_{t \geq 0} be an e-process for H0H_0, and let α(0,1)\alpha \in (0, 1). The test that rejects H0H_0 at the first time EtE_t exceeds 1/α1/\alpha (and never rejects if no such time exists) has Type I error at most α\alpha, uniformly over all stopping rules: for every stopping time τ\tau adapted to the filtration of (Et)(E_t), supPH0PrP ⁣[supsτEs1α]α.\sup_{P \in H_0} \Pr_P\!\left[\sup_{s \leq \tau} E_s \geq \tfrac{1}{\alpha}\right] \leq \alpha.

Intuition

The e-process is a nonnegative supermartingale under the null with EH0[E0]1\mathbb{E}_{H_0}[E_0] \leq 1. Ville's inequality says the probability that such a process ever exceeds level cc is at most 1/c1/c. Setting c=1/αc = 1/\alpha delivers the guarantee.

Proof Sketch

Ville's inequality applied to (Et)(E_t) under H0H_0: for any c>0c > 0, PrH0(suptEtc)1/c\Pr_{H_0}(\sup_t E_t \geq c) \leq 1/c. The supremum dominates the value at any stopping time, so PrH0(supsτEsc)PrH0(supsEsc)1/c\Pr_{H_0}(\sup_{s \leq \tau} E_s \geq c) \leq \Pr_{H_0}(\sup_s E_s \geq c) \leq 1/c. Choose c=1/αc = 1/\alpha.

Why It Matters

The supremum is over all stopping times, including stopping times that depend on the running e-process value, on auxiliary signals, on calendar dates, or on the analyst's mood. Type I error control is robust to the entire decision rule, which is what makes the framework useful in messy production settings.

Failure Mode

The e-process must be built from conditional bets adapted to the filtration. Recomputing the bets retroactively, or constructing bets that look ahead, breaks the supermartingale property and invalidates the bound. The user does not have to pre-specify the stopping rule, but they do have to pre-specify the betting strategy (or the alternative model in the likelihood-ratio case).

Stopping Times and Optional Stopping

A stopping time τ\tau is a time variable whose value at tt depends only on observations available by time tt. Formally, {τt}Ft\{\tau \leq t\} \in \mathcal{F}_t for every tt. Examples:

  • Fixed sample size τ=n0\tau = n_0.
  • First time the running mean crosses a threshold: τ=inf{t:Xˉtμ0δ}\tau = \inf\{t : |\bar X_t - \mu_0| \geq \delta\}.
  • First time the e-process exceeds the rejection threshold: τ=inf{t:Et1/α}\tau = \inf\{t : E_t \geq 1/\alpha\}.
  • Calendar-based: stop on the last business day of the quarter.

The classical optional stopping theorem (Doob) for nonnegative supermartingales says E[Eτ1{τ<}]E[E0]\mathbb{E}[E_\tau \mathbf{1}\{\tau < \infty\}] \leq \mathbb{E}[E_0] for every stopping time, given enough integrability. For nonnegative supermartingales the integrability is automatic.

A common confusion: "optional stopping" in the anytime-valid literature refers to the freedom to stop the experiment whenever you want without inflating Type I error. In classical pp-value-based testing, optional stopping is a bug. With e-processes, it is a feature.

Optional Continuation, Dashboards, and Online FDR

The dual freedom is the ability to keep collecting data past a planned stop. Three operational consequences:

Dashboards. A dashboard that updates the test statistic and rejection indicator every hour is anytime-valid as long as the underlying object is an e-process. The user can refresh the dashboard as often as they like.

Adaptive sample-size choice. "Continue until the e-value crosses 2020 or two weeks elapse, whichever first" is a valid stopping rule. The choice between budget exhaustion and evidence threshold can itself depend on the running e-process.

Online FDR. When a stream of hypothesis tests arrives sequentially and the analyst must decide which to reject without seeing future tests, the LORD (Levels based On Recent Discovery) procedure and its variants (Javanmard-Montanari 2018, Aharoni-Rosset 2014, Tian-Ramdas 2019) control false discovery rate under arbitrary stopping. The e-value version (Wang-Ramdas 2022) handles arbitrary dependence between tests.

Canonical Example: A/B Test With Continuous Peeking

A product team runs a two-arm A/B test with binary conversion outcomes. The platform exposes hourly interim results. The team wants to stop early if the treatment looks decisively better and reject the null hypothesis H0:pTpC0H_0: p_T - p_C \leq 0.

Step 1: Construct an e-process EtE_t for H0H_0 based on the running conversion-rate differential. Empirical-Bernstein constructions (Howard-Ramdas-McAuliffe-Sekhon 2021) give explicit formulas for bounded outcomes.

Step 2: Define the rejection rule "stop and reject at the first tt where Et20E_t \geq 20." For α=0.05\alpha = 0.05, 1/α=201/\alpha = 20.

Step 3: Define the budget rule "stop without rejection at t=Tmaxt = T_{\max} if the e-process has not crossed 2020 by then."

The combined stopping rule τ=min(inf{t:Et20},Tmax)\tau = \min(\inf\{t : E_t \geq 20\},\, T_{\max}) is a stopping time. Under H0H_0, Ville's inequality gives Type I error at most 5%5\%, uniformly across all possible decision rules consistent with the stopping rule. The analyst can refresh the dashboard arbitrarily often without inflating false-positive rate.

The trade-off: at any fixed tt, the e-value-based interval is wider than the classical zz-test-based interval. Typical width inflation is a factor of log(1/α)\sqrt{\log(1/\alpha)} to logt\sqrt{\log t} depending on the construction, with the latter being the iterated-logarithm rate of the law of iterated logarithm. For long experiments, the extra width is small relative to the convenience of monitoring.

Common Misconceptions

Watch Out

Anytime-valid is not always-correct

The guarantee is on Type I error rate, not on individual decisions. A specific run can produce a Type I error; the framework controls the long-run frequency of Type I errors at level α\alpha, just like classical hypothesis testing. The improvement over classical pp-values is robustness to the stopping rule, not to individual-decision randomness.

Watch Out

Power is not lost in general

Anytime-valid tests are conservative at any fixed nn relative to the optimal Neyman-Pearson test at that nn. They are more powerful than fixed-nn tests in the practical comparison where the alternative is "any procedure that needs to monitor." Sample-size-to-detection ratios under standard alternatives are within a log\log factor of the Neyman-Pearson optimum, often better than corrected group-sequential boundaries.

Watch Out

Stopping rule does not need to be pre-specified

Group-sequential tests require pre-specified look schedules. Anytime-valid tests do not: the analyst can stop at any time for any reason, as long as the betting strategy (the conditional e-values) is fixed in advance. Pre-registering the betting strategy is the modern analog of pre-registering the analysis.

Worked Exercise

ExerciseAdvanced

Problem

Suppose you run an A/B test with binary outcomes. The empirical-Bernstein e-process for testing H0:pTpCH_0: p_T \leq p_C with bounded outcomes takes the form (in a simplified version) Et=st(1+λs(Ys0)),E_t = \prod_{s \leq t} (1 + \lambda_s (Y_s - 0)), where Ys[1,1]Y_s \in [-1, 1] is the running treatment-control difference of conversion indicators at time ss, and λs[1,1]\lambda_s \in [-1, 1] is a predictable betting fraction chosen before observing YsY_s. Show that this is an e-process if λs\lambda_s is fixed conditional on Fs1\mathcal{F}_{s-1} and EH0[YsFs1]0\mathbb{E}_{H_0}[Y_s \mid \mathcal{F}_{s-1}] \leq 0.

Implementation Note

The confseq Python package (companion to Howard et al. 2021) gives a clean API:

from confseq.boundaries import gamma_exponential_log_mixture
e_process = gamma_exponential_log_mixture(
    x_samples,           # observed sample-mean differences
    v_samples,           # cumulative variance estimates
    alpha=0.05,
)
if e_process[-1] >= 1 / 0.05:
    print("Reject H_0 at time", len(x_samples))

For ML evaluation, the peeky and evalsync open-source packages wrap empirical-Bernstein e-processes for benchmark scoring. Both report the running e-value, the calibrated pp-value, and the current confidence sequence after each new item.

The discipline rule that matters most: the betting strategy (or alternative model) must be fixed before each observation. The stopping rule can be anything. In production, this typically means: pre-register the betting fractions or model parameters in code; let the analyst choose when to inspect the dashboard.

References

Canonical:

  • Ramdas, A., Grünwald, P., Vovk, V., and Shafer, G. (2023). "Game-theoretic statistics and safe anytime-valid inference." Statistical Science 38(4), pp. 576-601. Survey paper with anytime-valid inference as the unifying theme.
  • Howard, S. R., Ramdas, A., McAuliffe, J., and Sekhon, J. (2021). "Time-uniform Chernoff bounds via nonnegative supermartingales." Probability Surveys 18, pp. 257-317. Quantitative time-uniform concentration that produces the practical confidence sequences.
  • Johari, R., Koomen, P., Pekelis, L., and Walsh, D. (2017). "Peeking at A/B Tests: Why it matters, and what to do about it." Proceedings of KDD 2017. The Optimizely paper that popularized the framing for industry A/B testing.

Current:

  • Waudby-Smith, I. and Ramdas, A. (2024). "Estimating means of bounded random variables by betting." Journal of the Royal Statistical Society, Series B 86(1), pp. 1-27. State-of-the-art betting construction.
  • Maharaj, K., Williamson, R. J., Mathieu, A., and Williamson, T. (2023). "Anytime-valid inference for multinomial counts and stratified means via betting." Preprint arXiv:2310.19527. Multinomial extensions.
  • Javanmard, A. and Montanari, A. (2018). "Online rules for control of false discovery rate and false discovery exceedance." Annals of Statistics 46(2). LORD-style online FDR with anytime guarantees.

Historical:

  • Ville, J. (1939). Étude critique de la notion de collectif (Gauthier-Villars). The original supermartingale inequality.
  • Robbins, H. (1970). "Statistical methods related to the law of the iterated logarithm." Annals of Mathematical Statistics 41(5). The mixture-method confidence sequences that pre-date the modern e-process literature.

Next Topics

Last reviewed: May 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

6

Derived topics

1

Graph-backed continuations