What Each States
Both theorems say that a normalized sum converges in distribution to a Gaussian. They differ in what conditions the summands must satisfy.
Classical CLT: the summands are independent and identically distributed. No dependence is allowed.
Martingale CLT: the summands form a martingale difference sequence. They can be dependent, but each increment must have conditional mean zero given the past.
Side-by-Side Statement
Classical CLT (Lindeberg-Levy)
Let be i.i.d. with and . Then:
where .
Martingale CLT
Let be a martingale difference sequence: for all . Let be the conditional variance. Define . If:
- for some deterministic sequence
- The Lindeberg condition holds: for all ,
Then:
Where Each Is Stronger
Classical CLT is simpler to verify
You check two things: are the variables i.i.d.? Is the variance finite? If yes to both, the CLT holds. No filtration, no conditional variance, no Lindeberg condition. For most basic statistical applications (sample means, confidence intervals), the classical CLT suffices.
Martingale CLT handles dependent data
In any setting where the data is generated adaptively (the distribution of depends on ), the classical CLT does not apply. The martingale CLT covers these cases, which include most of the interesting settings in modern ML.
Where Each Fails
Classical CLT fails for adaptive/sequential data
In stochastic gradient descent, the gradient at step depends on the parameter , which depends on all previous gradients. The gradient noise terms are not independent. In multi-armed bandits, the reward at time depends on the arm selection, which depends on all previous rewards. In either case, the classical CLT does not apply.
Martingale CLT fails without mean-zero increments
The martingale difference condition is restrictive. Not all dependent sequences satisfy it. For example, if observations have a non-zero conditional mean that depends on the past (e.g., a stationary ergodic process with positive autocorrelation), you need different tools (mixing conditions, or the ergodic theorem combined with a CLT for mixing sequences).
Both fail for heavy-tailed distributions
Both CLTs require finite variance ( or the conditional variance condition). For heavy-tailed distributions where the variance is infinite (e.g., Cauchy, stable distributions with ), the normalized sum converges to a stable distribution, not a Gaussian.
Key Assumptions That Differ
| Classical CLT | Martingale CLT | |
|---|---|---|
| Independence | Required (i.i.d.) | Not required |
| Identical distribution | Required | Not required |
| Mean-zero condition | is subtracted | $\mathbb[D_n |
| Variance condition | Conditional variance stabilizes | |
| Lindeberg condition | Automatic for i.i.d. | Must be verified |
| Applies to SGD noise | No | Yes |
| Applies to bandit regret | No | Yes |
When a Researcher Would Use Each
Confidence interval for a population mean
Given i.i.d. samples from a distribution with unknown mean and finite variance , the classical CLT gives for large . This yields the standard confidence interval . The classical CLT is the right tool here.
Asymptotic normality of SGD iterates
In SGD, where is the stochastic gradient. The noise is a martingale difference (conditional mean zero given , the history up to step ), but it is not independent of the past (since depends on all previous noise terms). The martingale CLT is needed to establish that converges to a Gaussian, which is the foundation of statistical inference for SGD.
Regret analysis in bandits
In a multi-armed bandit, the reward at time depends on the arm chosen, which depends on all previous rewards. The centered reward (where is the chosen arm and is its mean) forms a martingale difference sequence. The martingale CLT gives the asymptotic distribution of the cumulative regret, enabling construction of confidence intervals for the regret.
Common Confusions
Martingale CLT does not require stationarity
The classical CLT uses identical distributions (stationarity). The martingale CLT allows the conditional variance to change over time. The key requirement is that the sum of conditional variances grows at a predictable rate. This flexibility is what makes it applicable to non-stationary settings like SGD with decaying learning rates.
The Lindeberg condition is not always easy to check
For the classical CLT with i.i.d. variables, the Lindeberg condition is automatic (it follows from finite variance). For martingale differences, verifying the Lindeberg condition requires bounding the conditional probability of large increments. In practice, if the martingale differences are uniformly bounded (), the Lindeberg condition holds trivially. For unbounded increments, it requires more work.
The martingale CLT is not a strict generalization of the classical CLT
While every i.i.d. mean-zero sequence is a martingale difference sequence, the conditions of the martingale CLT (stabilization of conditional variance, Lindeberg condition) are not automatically implied by the i.i.d. assumption in exactly the same way. In practice, for i.i.d. sequences, the classical CLT is simpler and sharper. The martingale CLT is more general, but at the cost of more conditions to verify.
What to Memorize
- Classical CLT: i.i.d., finite variance, .
- Martingale CLT: martingale difference sequence, conditional variance stabilizes, Lindeberg condition, same Gaussian limit.
- When independence fails: adaptive algorithms (SGD, bandits, online learning) need the martingale CLT.
- The key structural requirement of the martingale CLT: conditional mean zero given the past. This is the "unbiased noise" condition.
- Practical check: if increments are uniformly bounded, the Lindeberg condition is free. Focus on verifying the conditional variance condition.