Anytime-Valid Inference

Sneiderman, Robby

Sequential Inference

Anytime-Valid Inference

A framework where statistical guarantees hold simultaneously at every stopping time, not just at a pre-specified sample size. Built on e-processes and Ville's inequality. The decision rule and the stopping rule can both depend on data without inflating Type I error. The technical setting behind continuous A/B-test monitoring, adaptive clinical trials, and rolling LLM evaluations.

AdvancedResearchTier 1CurrentCore spine~45 min

For:MLStatsResearch

Prerequisites

E Values E Processes Martingale Theory Hypothesis Testing for ML

Prereq Map

Why This Matters

The dominant pattern of modern statistical practice is continuous monitoring. An A/B testing platform shows interim results every hour. A clinical trial has a Data Safety Monitoring Board reviewing accumulating outcomes. An LLM evaluation pipeline ingests new benchmark items as they become available and updates a dashboard. In each case, an analyst is looking at the data many times and is allowed (encouraged, often) to stop early when the evidence is convincing.

Classical fixed- $n$ tests offer no guarantee under this pattern. The $p$ -value from a $z$ -test at $n = 1000$ controls Type I error only if $n$ was chosen before any data were seen. Peeking at $n = 100, 200, \ldots$ and stopping at the first $n$ where $p \leq 0.05$ inflates the false-positive rate well past $5\%$ . The classical fix is to pre-register the look schedule and use Pocock or O'Brien-Fleming boundaries, but those are rigid and not robust to the actual decision rule used by practitioners.

Anytime-valid inference is the formal framework that handles this case. It guarantees a single Type I error level $\alpha$ that holds at every stopping time simultaneously, including stopping times that depend on the running data. Built on e-processes and Ville's inequality, the framework recovers classical results in the fixed- $n$ regime and degrades gracefully (with a small power loss) when used at a single fixed sample size.

Formal Setup

Definition

Anytime-valid test

A sequential test of $H_0$ is anytime-valid at level $\alpha$ if there exists a rejection rule based on the running data such that, for every stopping time $\tau$ (possibly random and possibly $\infty$ ), $\sup_{P \in H_0} \Pr_P[\text{reject at } \tau, \, \tau < \infty] \leq \alpha.$ The standard construction is to reject the first time the e-process $E_t$ exceeds $1/\alpha$ .

Definition

Anytime-valid confidence sequence

A sequence of intervals $(\mathrm{CI}_t)$ on a parameter $\theta$ is a $(1 - \alpha)$ -anytime-valid confidence sequence (CS) if $\Pr_\theta\!\left[\theta \in \mathrm{CI}_t \text{ for every } t\right] \geq 1 - \alpha.$ Equivalent dual statement: $\Pr_\theta[\exists t : \theta \notin \mathrm{CI}_t] \leq \alpha$ . The interval is the inversion of an anytime-valid test of the singleton null $H_0: \mu = \theta$ .

Definition

Optional continuation

The dual of optional stopping. The analyst is allowed to continue collecting data past a planned $n$ if the evidence is inconclusive. Like optional stopping, this breaks classical $p$ -values but is preserved by e-processes.

The Core Guarantee

The technical content of anytime-valid inference is one application of Ville's inequality.

Theorem

Anytime-Valid Type I Error Control

Statement

Let $(E_t)_{t \geq 0}$ be an e-process for $H_0$ , and let $\alpha \in (0, 1)$ . The test that rejects $H_0$ at the first time $E_t$ exceeds $1/\alpha$ (and never rejects if no such time exists) has Type I error at most $\alpha$ , uniformly over all stopping rules: for every stopping time $\tau$ adapted to the filtration of $(E_t)$ , $\sup_{P \in H_0} \Pr_P\!\left[\sup_{s \leq \tau} E_s \geq \tfrac{1}{\alpha}\right] \leq \alpha.$

Intuition

The e-process is a nonnegative supermartingale under the null with $\mathbb{E}_{H_0}[E_0] \leq 1$ . Ville's inequality says the probability that such a process ever exceeds level $c$ is at most $1/c$ . Setting $c = 1/\alpha$ delivers the guarantee.

Proof Sketch

Ville's inequality applied to $(E_t)$ under $H_0$ : for any $c > 0$ , $\Pr_{H_0}(\sup_t E_t \geq c) \leq 1/c$ . The supremum dominates the value at any stopping time, so $\Pr_{H_0}(\sup_{s \leq \tau} E_s \geq c) \leq \Pr_{H_0}(\sup_s E_s \geq c) \leq 1/c$ . Choose $c = 1/\alpha$ .

Why It Matters

The supremum is over all stopping times, including stopping times that depend on the running e-process value, on auxiliary signals, on calendar dates, or on the analyst's mood. Type I error control is robust to the entire decision rule, which is what makes the framework useful in messy production settings.

Failure Mode

The e-process must be built from conditional bets adapted to the filtration. Recomputing the bets retroactively, or constructing bets that look ahead, breaks the supermartingale property and invalidates the bound. The user does not have to pre-specify the stopping rule, but they do have to pre-specify the betting strategy (or the alternative model in the likelihood-ratio case).

report a correction →

Stopping Times and Optional Stopping

A stopping time $\tau$ is a time variable whose value at $t$ depends only on observations available by time $t$ . Formally, $\{\tau \leq t\} \in \mathcal{F}_t$ for every $t$ . Examples:

Fixed sample size $\tau = n_0$ .
First time the running mean crosses a threshold: $\tau = \inf\{t : |\bar X_t - \mu_0| \geq \delta\}$ .
First time the e-process exceeds the rejection threshold: $\tau = \inf\{t : E_t \geq 1/\alpha\}$ .
Calendar-based: stop on the last business day of the quarter.

The classical optional stopping theorem (Doob) for nonnegative supermartingales says $\mathbb{E}[E_\tau \mathbf{1}\{\tau < \infty\}] \leq \mathbb{E}[E_0]$ for every stopping time, given enough integrability. For nonnegative supermartingales the integrability is automatic.

A common confusion: "optional stopping" in the anytime-valid literature refers to the freedom to stop the experiment whenever you want without inflating Type I error. In classical $p$ -value-based testing, optional stopping is a bug. With e-processes, it is a feature.

Optional Continuation, Dashboards, and Online FDR

The dual freedom is the ability to keep collecting data past a planned stop. Three operational consequences:

Dashboards. A dashboard that updates the test statistic and rejection indicator every hour is anytime-valid as long as the underlying object is an e-process. The user can refresh the dashboard as often as they like.

Adaptive sample-size choice. "Continue until the e-value crosses $20$ or two weeks elapse, whichever first" is a valid stopping rule. The choice between budget exhaustion and evidence threshold can itself depend on the running e-process.

Online FDR. When a stream of hypothesis tests arrives sequentially and the analyst must decide which to reject without seeing future tests, the LORD (Levels based On Recent Discovery) procedure and its variants (Javanmard-Montanari 2018, Aharoni-Rosset 2014, Tian-Ramdas 2019) control false discovery rate under arbitrary stopping. The e-value version (Wang-Ramdas 2022) handles arbitrary dependence between tests.

Canonical Example: A/B Test With Continuous Peeking

A product team runs a two-arm A/B test with binary conversion outcomes. The platform exposes hourly interim results. The team wants to stop early if the treatment looks decisively better and reject the null hypothesis $H_0: p_T - p_C \leq 0$ .

Step 1: Construct an e-process $E_t$ for $H_0$ based on the running conversion-rate differential. Empirical-Bernstein constructions (Howard-Ramdas-McAuliffe-Sekhon 2021) give explicit formulas for bounded outcomes.

Step 2: Define the rejection rule "stop and reject at the first $t$ where $E_t \geq 20$ ." For $\alpha = 0.05$ , $1/\alpha = 20$ .

Step 3: Define the budget rule "stop without rejection at $t = T_{\max}$ if the e-process has not crossed $20$ by then."

The combined stopping rule $\tau = \min(\inf\{t : E_t \geq 20\},\, T_{\max})$ is a stopping time. Under $H_0$ , Ville's inequality gives Type I error at most $5\%$ , uniformly across all possible decision rules consistent with the stopping rule. The analyst can refresh the dashboard arbitrarily often without inflating false-positive rate.

The trade-off: at any fixed $t$ , the e-value-based interval is wider than the classical $z$ -test-based interval. Typical width inflation is a factor of $\sqrt{\log(1/\alpha)}$ to $\sqrt{\log t}$ depending on the construction, with the latter being the iterated-logarithm rate of the law of iterated logarithm. For long experiments, the extra width is small relative to the convenience of monitoring.

Common Misconceptions

Watch Out

Anytime-valid is not always-correct

The guarantee is on Type I error rate, not on individual decisions. A specific run can produce a Type I error; the framework controls the long-run frequency of Type I errors at level $\alpha$ , just like classical hypothesis testing. The improvement over classical $p$ -values is robustness to the stopping rule, not to individual-decision randomness.

Watch Out

Power is not lost in general

Anytime-valid tests are conservative at any fixed $n$ relative to the optimal Neyman-Pearson test at that $n$ . They are more powerful than fixed- $n$ tests in the practical comparison where the alternative is "any procedure that needs to monitor." Sample-size-to-detection ratios under standard alternatives are within a $\log$ factor of the Neyman-Pearson optimum, often better than corrected group-sequential boundaries.

Watch Out

Stopping rule does not need to be pre-specified

Group-sequential tests require pre-specified look schedules. Anytime-valid tests do not: the analyst can stop at any time for any reason, as long as the betting strategy (the conditional e-values) is fixed in advance. Pre-registering the betting strategy is the modern analog of pre-registering the analysis.

Worked Exercise

ExerciseAdvanced

Problem

Suppose you run an A/B test with binary outcomes. The empirical-Bernstein e-process for testing $H_0: p_T \leq p_C$ with bounded outcomes takes the form (in a simplified version) $E_t = \prod_{s \leq t} (1 + \lambda_s (Y_s - 0)),$ where $Y_s \in [-1, 1]$ is the running treatment-control difference of conversion indicators at time $s$ , and $\lambda_s \in [-1, 1]$ is a predictable betting fraction chosen before observing $Y_s$ . Show that this is an e-process if $\lambda_s$ is fixed conditional on $\mathcal{F}_{s-1}$ and $\mathbb{E}_{H_0}[Y_s \mid \mathcal{F}_{s-1}] \leq 0$ .

Implementation Note

The confseq Python package (companion to Howard et al. 2021) gives a clean API:

from confseq.boundaries import gamma_exponential_log_mixture
e_process = gamma_exponential_log_mixture(
    x_samples,           # observed sample-mean differences
    v_samples,           # cumulative variance estimates
    alpha=0.05,
)
if e_process[-1] >= 1 / 0.05:
    print("Reject H_0 at time", len(x_samples))

For ML evaluation, the peeky and evalsync open-source packages wrap empirical-Bernstein e-processes for benchmark scoring. Both report the running e-value, the calibrated $p$ -value, and the current confidence sequence after each new item.

The discipline rule that matters most: the betting strategy (or alternative model) must be fixed before each observation. The stopping rule can be anything. In production, this typically means: pre-register the betting fractions or model parameters in code; let the analyst choose when to inspect the dashboard.

References

Canonical:

Ramdas, A., Grünwald, P., Vovk, V., and Shafer, G. (2023). "Game-theoretic statistics and safe anytime-valid inference." Statistical Science 38(4), pp. 576-601. Survey paper with anytime-valid inference as the unifying theme.
Howard, S. R., Ramdas, A., McAuliffe, J., and Sekhon, J. (2021). "Time-uniform Chernoff bounds via nonnegative supermartingales." Probability Surveys 18, pp. 257-317. Quantitative time-uniform concentration that produces the practical confidence sequences.
Johari, R., Koomen, P., Pekelis, L., and Walsh, D. (2017). "Peeking at A/B Tests: Why it matters, and what to do about it." Proceedings of KDD 2017. The Optimizely paper that popularized the framing for industry A/B testing.

Current:

Waudby-Smith, I. and Ramdas, A. (2024). "Estimating means of bounded random variables by betting." Journal of the Royal Statistical Society, Series B 86(1), pp. 1-27. State-of-the-art betting construction.
Maharaj, K., Williamson, R. J., Mathieu, A., and Williamson, T. (2023). "Anytime-valid inference for multinomial counts and stratified means via betting." Preprint arXiv:2310.19527. Multinomial extensions.
Javanmard, A. and Montanari, A. (2018). "Online rules for control of false discovery rate and false discovery exceedance." Annals of Statistics 46(2). LORD-style online FDR with anytime guarantees.

Historical:

Ville, J. (1939). Étude critique de la notion de collectif (Gauthier-Villars). The original supermartingale inequality.
Robbins, H. (1970). "Statistical methods related to the law of the iterated logarithm." Annals of Mathematical Statistics 41(5). The mixture-method confidence sequences that pre-date the modern e-process literature.

Next Topics

Confidence sequences: the interval estimates of anytime-valid inference.
Safe testing: the formal framework built directly on e-values.
E-processes: the underlying object that powers all anytime-valid guarantees.
E-values and anytime-valid inference: the umbrella page with proofs, multiple-testing applications, and the e-BH procedure.
p-hacking and multiple testing: the classical pathologies that motivate the framework.

Last reviewed: May 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

6

Concentration Inequalitieslayer 1 · tier 1
e-valueslayer 2 · tier 1
p-valueslayer 2 · tier 1
e-processeslayer 3 · tier 1
Martingale Theorylayer 0B · tier 2

Derived topics

1

Confidence Sequenceslayer 2 · tier 1

Graph-backed continuations

Confidence Sequences