Sequential Inference
Anytime-Valid Inference
A framework where statistical guarantees hold simultaneously at every stopping time, not just at a pre-specified sample size. Built on e-processes and Ville's inequality. The decision rule and the stopping rule can both depend on data without inflating Type I error. The technical setting behind continuous A/B-test monitoring, adaptive clinical trials, and rolling LLM evaluations.
Why This Matters
The dominant pattern of modern statistical practice is continuous monitoring. An A/B testing platform shows interim results every hour. A clinical trial has a Data Safety Monitoring Board reviewing accumulating outcomes. An LLM evaluation pipeline ingests new benchmark items as they become available and updates a dashboard. In each case, an analyst is looking at the data many times and is allowed (encouraged, often) to stop early when the evidence is convincing.
Classical fixed- tests offer no guarantee under this pattern. The -value from a -test at controls Type I error only if was chosen before any data were seen. Peeking at and stopping at the first where inflates the false-positive rate well past . The classical fix is to pre-register the look schedule and use Pocock or O'Brien-Fleming boundaries, but those are rigid and not robust to the actual decision rule used by practitioners.
Anytime-valid inference is the formal framework that handles this case. It guarantees a single Type I error level that holds at every stopping time simultaneously, including stopping times that depend on the running data. Built on e-processes and Ville's inequality, the framework recovers classical results in the fixed- regime and degrades gracefully (with a small power loss) when used at a single fixed sample size.
Formal Setup
Anytime-valid test
A sequential test of is anytime-valid at level if there exists a rejection rule based on the running data such that, for every stopping time (possibly random and possibly ), The standard construction is to reject the first time the e-process exceeds .
Anytime-valid confidence sequence
A sequence of intervals on a parameter is a -anytime-valid confidence sequence (CS) if Equivalent dual statement: . The interval is the inversion of an anytime-valid test of the singleton null .
Optional continuation
The dual of optional stopping. The analyst is allowed to continue collecting data past a planned if the evidence is inconclusive. Like optional stopping, this breaks classical -values but is preserved by e-processes.
The Core Guarantee
The technical content of anytime-valid inference is one application of Ville's inequality.
Anytime-Valid Type I Error Control
Statement
Let be an e-process for , and let . The test that rejects at the first time exceeds (and never rejects if no such time exists) has Type I error at most , uniformly over all stopping rules: for every stopping time adapted to the filtration of ,
Intuition
The e-process is a nonnegative supermartingale under the null with . Ville's inequality says the probability that such a process ever exceeds level is at most . Setting delivers the guarantee.
Proof Sketch
Ville's inequality applied to under : for any , . The supremum dominates the value at any stopping time, so . Choose .
Why It Matters
The supremum is over all stopping times, including stopping times that depend on the running e-process value, on auxiliary signals, on calendar dates, or on the analyst's mood. Type I error control is robust to the entire decision rule, which is what makes the framework useful in messy production settings.
Failure Mode
The e-process must be built from conditional bets adapted to the filtration. Recomputing the bets retroactively, or constructing bets that look ahead, breaks the supermartingale property and invalidates the bound. The user does not have to pre-specify the stopping rule, but they do have to pre-specify the betting strategy (or the alternative model in the likelihood-ratio case).
Stopping Times and Optional Stopping
A stopping time is a time variable whose value at depends only on observations available by time . Formally, for every . Examples:
- Fixed sample size .
- First time the running mean crosses a threshold: .
- First time the e-process exceeds the rejection threshold: .
- Calendar-based: stop on the last business day of the quarter.
The classical optional stopping theorem (Doob) for nonnegative supermartingales says for every stopping time, given enough integrability. For nonnegative supermartingales the integrability is automatic.
A common confusion: "optional stopping" in the anytime-valid literature refers to the freedom to stop the experiment whenever you want without inflating Type I error. In classical -value-based testing, optional stopping is a bug. With e-processes, it is a feature.
Optional Continuation, Dashboards, and Online FDR
The dual freedom is the ability to keep collecting data past a planned stop. Three operational consequences:
Dashboards. A dashboard that updates the test statistic and rejection indicator every hour is anytime-valid as long as the underlying object is an e-process. The user can refresh the dashboard as often as they like.
Adaptive sample-size choice. "Continue until the e-value crosses or two weeks elapse, whichever first" is a valid stopping rule. The choice between budget exhaustion and evidence threshold can itself depend on the running e-process.
Online FDR. When a stream of hypothesis tests arrives sequentially and the analyst must decide which to reject without seeing future tests, the LORD (Levels based On Recent Discovery) procedure and its variants (Javanmard-Montanari 2018, Aharoni-Rosset 2014, Tian-Ramdas 2019) control false discovery rate under arbitrary stopping. The e-value version (Wang-Ramdas 2022) handles arbitrary dependence between tests.
Canonical Example: A/B Test With Continuous Peeking
A product team runs a two-arm A/B test with binary conversion outcomes. The platform exposes hourly interim results. The team wants to stop early if the treatment looks decisively better and reject the null hypothesis .
Step 1: Construct an e-process for based on the running conversion-rate differential. Empirical-Bernstein constructions (Howard-Ramdas-McAuliffe-Sekhon 2021) give explicit formulas for bounded outcomes.
Step 2: Define the rejection rule "stop and reject at the first where ." For , .
Step 3: Define the budget rule "stop without rejection at if the e-process has not crossed by then."
The combined stopping rule is a stopping time. Under , Ville's inequality gives Type I error at most , uniformly across all possible decision rules consistent with the stopping rule. The analyst can refresh the dashboard arbitrarily often without inflating false-positive rate.
The trade-off: at any fixed , the e-value-based interval is wider than the classical -test-based interval. Typical width inflation is a factor of to depending on the construction, with the latter being the iterated-logarithm rate of the law of iterated logarithm. For long experiments, the extra width is small relative to the convenience of monitoring.
Common Misconceptions
Anytime-valid is not always-correct
The guarantee is on Type I error rate, not on individual decisions. A specific run can produce a Type I error; the framework controls the long-run frequency of Type I errors at level , just like classical hypothesis testing. The improvement over classical -values is robustness to the stopping rule, not to individual-decision randomness.
Power is not lost in general
Anytime-valid tests are conservative at any fixed relative to the optimal Neyman-Pearson test at that . They are more powerful than fixed- tests in the practical comparison where the alternative is "any procedure that needs to monitor." Sample-size-to-detection ratios under standard alternatives are within a factor of the Neyman-Pearson optimum, often better than corrected group-sequential boundaries.
Stopping rule does not need to be pre-specified
Group-sequential tests require pre-specified look schedules. Anytime-valid tests do not: the analyst can stop at any time for any reason, as long as the betting strategy (the conditional e-values) is fixed in advance. Pre-registering the betting strategy is the modern analog of pre-registering the analysis.
Worked Exercise
Problem
Suppose you run an A/B test with binary outcomes. The empirical-Bernstein e-process for testing with bounded outcomes takes the form (in a simplified version) where is the running treatment-control difference of conversion indicators at time , and is a predictable betting fraction chosen before observing . Show that this is an e-process if is fixed conditional on and .
Implementation Note
The confseq Python package (companion to Howard et al. 2021) gives a clean API:
from confseq.boundaries import gamma_exponential_log_mixture
e_process = gamma_exponential_log_mixture(
x_samples, # observed sample-mean differences
v_samples, # cumulative variance estimates
alpha=0.05,
)
if e_process[-1] >= 1 / 0.05:
print("Reject H_0 at time", len(x_samples))
For ML evaluation, the peeky and evalsync open-source packages wrap empirical-Bernstein e-processes for benchmark scoring. Both report the running e-value, the calibrated -value, and the current confidence sequence after each new item.
The discipline rule that matters most: the betting strategy (or alternative model) must be fixed before each observation. The stopping rule can be anything. In production, this typically means: pre-register the betting fractions or model parameters in code; let the analyst choose when to inspect the dashboard.
References
Canonical:
- Ramdas, A., Grünwald, P., Vovk, V., and Shafer, G. (2023). "Game-theoretic statistics and safe anytime-valid inference." Statistical Science 38(4), pp. 576-601. Survey paper with anytime-valid inference as the unifying theme.
- Howard, S. R., Ramdas, A., McAuliffe, J., and Sekhon, J. (2021). "Time-uniform Chernoff bounds via nonnegative supermartingales." Probability Surveys 18, pp. 257-317. Quantitative time-uniform concentration that produces the practical confidence sequences.
- Johari, R., Koomen, P., Pekelis, L., and Walsh, D. (2017). "Peeking at A/B Tests: Why it matters, and what to do about it." Proceedings of KDD 2017. The Optimizely paper that popularized the framing for industry A/B testing.
Current:
- Waudby-Smith, I. and Ramdas, A. (2024). "Estimating means of bounded random variables by betting." Journal of the Royal Statistical Society, Series B 86(1), pp. 1-27. State-of-the-art betting construction.
- Maharaj, K., Williamson, R. J., Mathieu, A., and Williamson, T. (2023). "Anytime-valid inference for multinomial counts and stratified means via betting." Preprint arXiv:2310.19527. Multinomial extensions.
- Javanmard, A. and Montanari, A. (2018). "Online rules for control of false discovery rate and false discovery exceedance." Annals of Statistics 46(2). LORD-style online FDR with anytime guarantees.
Historical:
- Ville, J. (1939). Étude critique de la notion de collectif (Gauthier-Villars). The original supermartingale inequality.
- Robbins, H. (1970). "Statistical methods related to the law of the iterated logarithm." Annals of Mathematical Statistics 41(5). The mixture-method confidence sequences that pre-date the modern e-process literature.
Next Topics
- Confidence sequences: the interval estimates of anytime-valid inference.
- Safe testing: the formal framework built directly on e-values.
- E-processes: the underlying object that powers all anytime-valid guarantees.
- E-values and anytime-valid inference: the umbrella page with proofs, multiple-testing applications, and the e-BH procedure.
- p-hacking and multiple testing: the classical pathologies that motivate the framework.
Last reviewed: May 13, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
6- Concentration Inequalitieslayer 1 · tier 1
- e-valueslayer 2 · tier 1
- p-valueslayer 2 · tier 1
- e-processeslayer 3 · tier 1
- Martingale Theorylayer 0B · tier 2
Derived topics
1- Confidence Sequenceslayer 2 · tier 1
Graph-backed continuations