Martingale CLT vs. Classical CLT

What Each States

Both theorems say that a normalized sum converges in distribution to a Gaussian. They differ in what conditions the summands must satisfy.

Classical CLT: the summands are independent and identically distributed. No dependence is allowed.

Martingale CLT: the summands form a martingale difference sequence. They can be dependent, but each increment must have conditional mean zero given the past.

Side-by-Side Statement

Definition

Classical CLT (Lindeberg-Levy)

Let $X_1, X_2, \ldots$ be i.i.d. with $\mathbb{E}[X_i] = \mu$ and $\text{Var}(X_i) = \sigma^2 < \infty$ . Then:

$\frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} \xrightarrow{d} \mathcal{N}(0, 1)$

where $\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i$ .

Definition

Martingale CLT

Let $\{D_n, \mathcal{F}_n\}$ be a martingale difference sequence: $\mathbb{E}[D_n | \mathcal{F}_{n-1}] = 0$ for all $n$ . Let $\sigma_n^2 = \mathbb{E}[D_n^2 | \mathcal{F}_{n-1}]$ be the conditional variance. Define $V_n^2 = \sum_{i=1}^n \sigma_i^2$ . If:

$V_n^2 / v_n^2 \xrightarrow{p} 1$ for some deterministic sequence $v_n^2 \to \infty$
The Lindeberg condition holds: for all $\epsilon > 0$ , $\frac{1}{v_n^2}\sum_{i=1}^n \mathbb{E}[D_i^2 \mathbf{1}(|D_i| > \epsilon v_n) | \mathcal{F}_{i-1}] \xrightarrow{p} 0$

Then:

$\frac{1}{v_n}\sum_{i=1}^n D_i \xrightarrow{d} \mathcal{N}(0, 1)$

Where Each Is Stronger

Classical CLT is simpler to verify

You check two things: are the variables i.i.d.? Is the variance finite? If yes to both, the CLT holds. No filtration, no conditional variance, no Lindeberg condition. For most basic statistical applications (sample means, confidence intervals), the classical CLT suffices.

Martingale CLT handles dependent data

In any setting where the data is generated adaptively (the distribution of $X_n$ depends on $X_1, \ldots, X_{n-1}$ ), the classical CLT does not apply. The martingale CLT covers these cases, which include most of the interesting settings in modern ML.

Where Each Fails

Classical CLT fails for adaptive/sequential data

In stochastic gradient descent, the gradient at step $n$ depends on the parameter $\theta_n$ , which depends on all previous gradients. The gradient noise terms are not independent. In multi-armed bandits, the reward at time $n$ depends on the arm selection, which depends on all previous rewards. In either case, the classical CLT does not apply.

Martingale CLT fails without mean-zero increments

The martingale difference condition $\mathbb{E}[D_n | \mathcal{F}_{n-1}] = 0$ is restrictive. Not all dependent sequences satisfy it. For example, if observations have a non-zero conditional mean that depends on the past (e.g., a stationary ergodic process with positive autocorrelation), you need different tools (mixing conditions, or the ergodic theorem combined with a CLT for mixing sequences).

Both fail for heavy-tailed distributions

Both CLTs require finite variance ( $\sigma^2 < \infty$ or the conditional variance condition). For heavy-tailed distributions where the variance is infinite (e.g., Cauchy, stable distributions with $\alpha < 2$ ), the normalized sum converges to a stable distribution, not a Gaussian.

Key Assumptions That Differ

	Classical CLT	Martingale CLT
Independence	Required (i.i.d.)	Not required
Identical distribution	Required	Not required
Mean-zero condition	$\mu$ is subtracted	$\mathbb{E}[D_n \mid \mathcal{F}_{n-1}] = 0$
Variance condition	$\sigma^2 < \infty$	Conditional variance stabilizes
Lindeberg condition	Automatic for i.i.d.	Must be verified
Applies to SGD noise	No	Yes
Applies to bandit regret	No	Yes

When a Researcher Would Use Each

Example

Confidence interval for a population mean

Given $n$ i.i.d. samples from a distribution with unknown mean $\mu$ and finite variance $\sigma^2$ , the classical CLT gives $\bar{X}_n \approx \mathcal{N}(\mu, \sigma^2/n)$ for large $n$ . This yields the standard confidence interval $\bar{X}_n \pm z_{\alpha/2} \cdot \hat{\sigma}/\sqrt{n}$ . The classical CLT is the right tool here.

Example

Asymptotic normality of SGD iterates

In SGD, $\theta_{n+1} = \theta_n - \eta_n g_n$ where $g_n = \nabla f(\theta_n; X_n)$ is the stochastic gradient. The noise $g_n - \nabla F(\theta_n)$ is a martingale difference (conditional mean zero given $\mathcal{F}_n$ , the history up to step $n$ ), but it is not independent of the past (since $\theta_n$ depends on all previous noise terms). The martingale CLT is needed to establish that $\sqrt{n}(\theta_n - \theta^*)$ converges to a Gaussian, which is the foundation of statistical inference for SGD.

Example

Regret analysis in bandits

In a multi-armed bandit, the reward $R_t$ at time $t$ depends on the arm chosen, which depends on all previous rewards. The centered reward $R_t - \mu_{A_t}$ (where $A_t$ is the chosen arm and $\mu_{A_t}$ is its mean) forms a martingale difference sequence. The martingale CLT gives the asymptotic distribution of the cumulative regret, enabling construction of confidence intervals for the regret.

Common Confusions

Watch Out

Martingale CLT does not require stationarity

The classical CLT uses identical distributions (stationarity). The martingale CLT allows the conditional variance $\sigma_n^2$ to change over time. The key requirement is that the sum of conditional variances $V_n^2$ grows at a predictable rate. This flexibility is what makes it applicable to non-stationary settings like SGD with decaying learning rates.

Watch Out

The Lindeberg condition is not always easy to check

For the classical CLT with i.i.d. variables, the Lindeberg condition is automatic (it follows from finite variance). For martingale differences, verifying the Lindeberg condition requires bounding the conditional probability of large increments. In practice, if the martingale differences are uniformly bounded ( $|D_n| \leq M$ ), the Lindeberg condition holds trivially. For unbounded increments, it requires more work.

Watch Out

The martingale CLT is not a strict generalization of the classical CLT

While every i.i.d. mean-zero sequence is a martingale difference sequence, the conditions of the martingale CLT (stabilization of conditional variance, Lindeberg condition) are not automatically implied by the i.i.d. assumption in exactly the same way. In practice, for i.i.d. sequences, the classical CLT is simpler and sharper. The martingale CLT is more general, but at the cost of more conditions to verify.

What to Memorize

Classical CLT: i.i.d., finite variance, $\sqrt{n}(\bar{X}_n - \mu)/\sigma \to \mathcal{N}(0,1)$ .
Martingale CLT: martingale difference sequence, conditional variance stabilizes, Lindeberg condition, same Gaussian limit.
When independence fails: adaptive algorithms (SGD, bandits, online learning) need the martingale CLT.
The key structural requirement of the martingale CLT: conditional mean zero given the past. This is the "unbiased noise" condition.
Practical check: if increments are uniformly bounded, the Lindeberg condition is free. Focus on verifying the conditional variance condition.