Concentration Probability

Bernstein Inequality

A variance-sensitive concentration inequality for independent bounded random variables. Bernstein sharpens Hoeffding when the variance is much smaller than the worst-case range.

CoreTier 1StableSupporting~40 min

Prerequisites

Concentration Inequalities Expectation Variance Covariance Moments Moment Generating Functions

Prereq Map

Why This Matters

Hoeffding's inequality uses only the range of each summand. That is safe, but it can be loose when the summands are usually small. Bernstein's inequality adds one more piece of information: the variance. The reward is a tighter tail bound in low-variance or sparse-loss regimes.

This is the concentration inequality behind many variance-sensitive learning bounds. If a classifier makes rare errors, or a loss is usually near zero, a range-only bound treats the problem as if worst-case losses happen often. Bernstein separates two facts:

the variables are bounded, so very large jumps cannot happen;
the variance is small, so moderate deviations are rarer than Hoeffding predicts.

Mental Model

Bernstein has two regimes.

Deviation size	Dominant term	Tail shape	Interpretation
Small $t$	variance $v$	$\exp(-t^2/(2v))$	Gaussian-like concentration with the true variance
Large $t$	range $Mt$	$\exp(-ct/M)$	sub-exponential tail caused by bounded but nonzero jump size

Hoeffding behaves as if the variance were always near the worst-case range variance. Bernstein uses the actual variance proxy.

Formal Setup

Let $X_1,\ldots,X_n$ be independent real-valued random variables. Write $\mu_i = \mathbb{E}[X_i]$ , assume each centered variable is bounded by $M$ :

$|X_i - \mu_i| \leq M \quad \text{almost surely},$

and define the variance proxy

$v = \sum_{i=1}^n \mathrm{Var}(X_i).$

The quantity of interest is the centered sum

$S_n = \sum_{i=1}^n (X_i - \mu_i).$

Theorem

Scalar Bernstein Inequality

Statement

If $X_1,\ldots,X_n$ are independent, $\mathbb{E}[X_i]=\mu_i$ , $|X_i-\mu_i|\leq M$ almost surely, and $v=\sum_{i=1}^n \mathrm{Var}(X_i)$ , then for every $t>0$ :

$\Pr\!\left[\sum_{i=1}^n (X_i-\mu_i) \geq t\right]\leq \exp\!\left(-\frac{t^2/2}{v + Mt/3}\right).$

A two-sided version follows by applying the same bound to $-S_n$ and using a union bound:

$\Pr[|S_n|\geq t]\leq 2\exp\!\left(-\frac{t^2/2}{v + Mt/3}\right).$

Intuition

The denominator has two terms. The variance term $v$ controls ordinary fluctuations near the mean. The bounded-jump term $Mt/3$ prevents the bound from pretending the tail is Gaussian forever. Far enough from the mean, a single large-but-bounded jump becomes the limiting obstruction.

Proof Sketch

Use the Chernoff method. For $\lambda>0$ :

$\Pr[S_n\geq t]\leq e^{-\lambda t}\mathbb{E}[e^{\lambda S_n}].$

Independence factors the moment generating function into a product. A Bernstein MGF lemma bounds each centered bounded summand:

$\mathbb{E}[e^{\lambda(X_i-\mu_i)}]\leq \exp\!\left(\frac{\lambda^2\mathrm{Var}(X_i)}{2(1-\lambda M/3)}\right)$

for $0<\lambda<3/M$ . Multiplying over $i$ gives:

$\mathbb{E}[e^{\lambda S_n}]\leq \exp\!\left(\frac{\lambda^2 v}{2(1-\lambda M/3)}\right).$

Optimizing the resulting exponent over $\lambda$ gives the displayed Bernstein bound.

Failure Mode

The bounded-deviation assumption is doing real work. If the variables are unbounded but merely have finite variance, Bernstein does not apply. Use Chebyshev, a sub-exponential condition, or a truncation argument instead.

report a correction →

Corollary

Bernstein Bound for a Sample Mean

Statement

If $X_1,\ldots,X_n$ are i.i.d., $\mathbb{E}[X_i]=\mu$ , $|X_i-\mu|\leq M$ almost surely, and $\mathrm{Var}(X_i)=\sigma^2$ , then:

$\Pr[|\bar{X}_n-\mu|\geq \epsilon]\leq 2\exp\!\left(-\frac{n\epsilon^2/2}{\sigma^2 + M\epsilon/3}\right).$

Why It Matters

This is the form used in statistics and learning theory. It says the sample mean concentrates at a rate controlled by the true variance for small $\epsilon$ , while still respecting the bounded range for larger deviations.

report a correction →

Bernstein vs. Hoeffding

For $X_i\in[0,1]$ , Hoeffding gives:

$\Pr[|\bar{X}_n-\mu|\geq \epsilon]\leq 2e^{-2n\epsilon^2}.$

Bernstein gives:

$\Pr[|\bar{X}_n-\mu|\geq \epsilon]\leq 2\exp\!\left(-\frac{n\epsilon^2/2}{\sigma^2+\epsilon/3}\right).$

If $\sigma^2$ is near $1/4$ , the two bounds are comparable up to constants. If $\sigma^2\ll 1$ , Bernstein can be much tighter.

Example

Suppose $X_i\in[0,1]$ , $\mathbb{E}[X_i]=0.01$ , and $\mathrm{Var}(X_i)\approx 0.01$ . To bound $\Pr[|\bar{X}_n-0.01|\geq 0.01]$ , Hoeffding uses only the range and behaves like $\exp(-0.0002n)$ . Bernstein uses the variance and behaves like $\exp(-0.00375n)$ under the common two-sided sample-mean form. The exponent is about 19 times larger, so the required sample size is far smaller.

Common Confusions

Watch Out

Bernstein is not always smaller than Hoeffding

Bernstein is variance-sensitive, not magic. Constants matter. At small sample sizes or when the variance is not much smaller than the worst-case range variance, Hoeffding can be just as useful.

Watch Out

The variance must be controlled

The bound needs a variance proxy. If you estimate variance from the same sample, that estimate needs its own concentration argument. Plugging in an empirical variance without accounting for its error is a different theorem.

Watch Out

Bounded range and sub-exponential are related but not identical

The scalar Bernstein inequality above assumes bounded centered deviations. The sub-exponential Bernstein bound replaces boundedness with an MGF condition that holds near zero. The two statements have similar two-regime shapes, but their assumptions are not the same.

Exercises

ExerciseCore

Problem

Let $X_i\in[0,1]$ be i.i.d. with $\mathrm{Var}(X_i)=0.01$ . Use the sample mean Bernstein bound to upper-bound $\Pr[|\bar{X}_n-\mu|\geq 0.05]$ .

ExerciseAdvanced

Problem

Explain why Bernstein is often the right bound for finite-class generalization when the loss is bounded and usually near zero.

Related Comparisons

References

Canonical:

Bernstein, S. N. (1924). Original paper introducing Bernstein-type probability inequalities.
Boucheron, Lugosi, Massart, Concentration Inequalities: A Nonasymptotic Theory of Independence (2013), Chapters 2-3
Massart, Concentration Inequalities and Model Selection (2007), Chapter 2

Current:

Vershynin, High-Dimensional Probability (2018), Chapter 2
Wainwright, High-Dimensional Statistics (2019), Chapter 2
Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Chapters 2-3

Next Topics

Sub-exponential random variables: the distributional class behind Bernstein-type tails
Matrix concentration: Bernstein bounds for random matrices
Uniform convergence: where Hoeffding and Bernstein feed learning bounds

Last reviewed: April 30, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
Concentration Inequalitieslayer 1 · tier 1
Moment Generating Functionslayer 0A · tier 2

Derived topics

No published topic currently declares this as a prerequisite.