Concentration Inequalities

Sneiderman, Robby

Concentration Probability

Concentration Inequalities

Bounds on how far random variables deviate from their expectations: Markov, Chebyshev, Hoeffding, and Bernstein. Used throughout generalization theory, bandits, and sample complexity.

CoreTier 1StableCore spine~70 min

Prerequisites

Common Probability Distributions Expectation Variance Covariance Moments Central Limit Theorem Common Inequalities

Quiz (50)Pulse Check Prereq Map

Why This Matters

Concentration inequalities are the backbone of classical uniform-convergence statistical learning theory. Most PAC-style sample complexity results and the core PAC-learning theorems ultimately rest on a statement of the form: "the empirical average is close to the true expectation with high probability." Modern generalization bounds (PAC-Bayes, algorithmic stability, information-theoretic bounds, and algorithm-dependent analyses) sometimes route through concentration only indirectly or replace boundedness assumptions with weaker moment / sub-Gaussian conditions, so the dependence is universal in spirit but not literal in form.

Five-panel infographic on concentration inequalities: the goal (tail bounds for random quantities staying close to their typical value), Markov's inequality as the loosest baseline using only the mean, Chebyshev using the variance, Hoeffding for sums of bounded independent variables (with an exponential-decay tail probability plot), and a how-to-choose-a-bound table comparing Markov / Chebyshev / Hoeffding / Bernstein with practical applications. — Concentration inequalities convert assumptions like nonnegativity, bounded variance, or boundedness into explicit tail bounds. The right one depends on what you can assume about the random variable.

When you see a bound like $\hat{R}_n(h) \approx R(h)$ in ERM theory, the proof always invokes a concentration inequality. Without these tools, you cannot prove anything about learning from finite data.

These are the four inequalities you will use most often, in order of increasing power and increasing assumptions:

Markov: requires only non-negativity
Chebyshev: requires finite variance
Hoeffding: requires bounded random variables
Bernstein: requires bounded range and uses variance information

Mental Model

Think of concentration inequalities as answering one question: how likely is it that the sample average $\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i$ deviates from $\mathbb{E}[X]$ by more than $t$ ?

Each inequality trades assumptions for power:

Markov/Chebyshev give polynomial tails ( $1/t^2$ ): slow decay.
Hoeffding/Bernstein give exponential tails ( $e^{-cnt^2}$ ): fast decay.

The exponential tail bounds are what make learning theory work. With polynomial tails, you need $n = O(1/\epsilon^2 \delta)$ samples. With exponential tails, you need $n = O(\log(1/\delta)/\epsilon^2)$ : the dependence on the confidence $\delta$ drops inside a logarithm.

Which Bound Should You Reach For?

This is the practical decision rule:

What you know	Default bound	Tail shape	Why you reach for it
Only non-negativity	Markov	polynomial, $1/t$	weakest possible structural assumption; useful as a first sanity bound
Finite variance but no bounded range	Chebyshev	polynomial, $1/t^2$	works when variance is known but boundedness or MGF control is unavailable
Independent bounded summands	Hoeffding	exponential, roughly $e^{-cnt^2}$	clean default for finite-sample learning bounds and bounded losses
Independent bounded summands with genuinely small variance	Bernstein	Gaussian near the origin, then sub-exponential	variance-adaptive; much tighter when the loss is sparse or low-noise

Stronger assumptions do not mean the resulting numeric bound is always smaller at every $(n,t)$ . The asymptotically stronger inequality can still be looser at modest sample sizes, which is why the worked example below compares Hoeffding and Chebyshev directly instead of repeating the slogan that "exponential beats polynomial."

Interactive: when Bernstein beats Hoeffding

Chebyshev, Hoeffding, and Bernstein all upper-bound the same tail probability. They do not agree on how much slack to leave. Plotted against the true binomial tail for $S_n = \frac{1}{n}\sum X_i$ with $X_i \sim \mathrm{Bernoulli}(p)$ , the three bounds arrange themselves by the amount of variance information they use. Push $p$ toward $0$ and Bernstein splits away from Hoeffding because $\sigma^2 = p(1-p)$ shrinks. Push $p$ to $0.5$ and the two exponential bounds nearly coincide.

Formal Setup and Notation

Let $X_1, X_2, \ldots, X_n$ be independent random variables. We write $\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i$ for the sample mean and $\mu = \mathbb{E}[\bar{X}_n]$ for its expectation (which equals $\mathbb{E}[X_1]$ when the $X_i$ are identically distributed).

We seek upper bounds on the tail probability:

$\Pr[|\bar{X}_n - \mu| \geq t]$

or equivalently the one-sided tail $\Pr[\bar{X}_n - \mu \geq t]$ .

Core Definitions

Definition

Tail Probability $Pr [X \geq t]$

The tail probability of a random variable $X$ at level $t$ is $\Pr[X \geq t]$ . Concentration inequalities provide upper bounds on tail probabilities, showing that $X$ is unlikely to deviate far from its mean. A bound is useful when it decays rapidly in $t$ .

Definition

Sub-Gaussian Behavior $Pr [X - μ \geq t] \leq e^{- c t^{2}}$

A random variable exhibits sub-Gaussian behavior if and only if its tail probability decays like $e^{-ct^2}$ for some constant $c > 0$ . Bounded random variables are sub-Gaussian (this is Hoeffding). The Gaussian decay rate is the "gold standard" for tail bounds. It means large deviations are exponentially unlikely.

Definition

Moment Generating Function $M_{X} (λ) = E [e^{λ X}]$

The moment generating function (MGF) of $X$ is $M_X(\lambda) = \mathbb{E}[e^{\lambda X}]$ . The MGF method — bounding the MGF and then optimizing over $\lambda$ — is the standard technique for deriving exponential concentration inequalities. It is also called the Chernoff bounding method (Chernoff 1952); Wainwright (2019) uses this terminology. Cramér's theorem is a distinct, asymptotic result about the large-deviations rate function.

Main Theorems

Theorem

Markov's Inequality

Statement

If $X \geq 0$ , then for any $t > 0$ :

$\Pr[X \geq t] \leq \frac{\mathbb{E}[X]}{t}$

Exact statement

Pr [X \geq t] \leq \frac{E [ X ]}{t}

LaTeX source for copy/export

\Pr[X \geq t] \leq \frac{\mathbb{E}[X]}{t}

Intuition

If the average value of $X$ is small, then $X$ cannot be large too often. If $\mathbb{E}[X] = 1$ , then $X$ can be $\geq 100$ at most 1% of the time; otherwise the average would be too high.

Proof Sketch

Indicator form.

\mathbb{E}[X] = \mathbb{E}[X \cdot \mathbf{1}[X \geq t]] + \mathbb{E}[X \cdot \mathbf{1}[X < t]] \geq \mathbb{E}[X \cdot \mathbf{1}[X \geq t]] \geq t \cdot \Pr[X \geq t].

Divide both sides by $t$ .

Integral (layer-cake) form. For $X \geq 0$ , the tail formula $\mathbb{E}[X] = \int_0^\infty \Pr[X \geq x]\,dx$ gives

\mathbb{E}[X] \geq \int_0^a \Pr[X \geq x]\,dx \geq \int_0^a \Pr[X \geq a]\,dx = a \cdot \Pr[X \geq a],

where the second inequality uses $\Pr[X \geq x] \geq \Pr[X \geq a]$ for every $x \in [0, a]$ (the tail is nonincreasing). Divide by $a$ . This is the form used in Shalev-Shwartz and Ben-David, Understanding Machine Learning, Appendix B.

Why It Matters

Markov's inequality is the foundation that all other concentration inequalities build upon. Chebyshev applies Markov to $(X - \mu)^2$ . Hoeffding and Bernstein apply Markov to $e^{\lambda(X - \mu)}$ . It is the weakest bound but requires the fewest assumptions.

Failure Mode

The bound $\Pr[X \geq t] \leq \mu/t$ decays only as $1/t$ , polynomially. This is far too slow for most applications. If $\mathbb{E}[X] = 1$ , Markov gives $\Pr[X \geq 10] \leq 1/10$ , independent of any variance or boundedness structure. For well-behaved distributions the true probability is astronomically smaller. Markov uses only $\mathbb{E}[X]$ ; it cannot see $\sigma$ . Chebyshev is the first inequality that brings variance information into the picture. You almost always want Hoeffding or Bernstein when those assumptions hold.

report a correction →

Lemma

Lemma B.1: Lower-tail bound for [0,1]-valued variables

Statement

If $X \in [0, 1]$ almost surely with $\mathbb{E}[X] = \mu$ , then for any $a \in (0, 1)$ :

$\Pr[X > 1 - a] \geq \frac{\mu - (1 - a)}{a}.$

Exact statement

Pr [X > 1 - a] \geq \frac{μ - ( 1 - a )}{a}

LaTeX source for copy/export

\Pr[X > 1 - a] \geq \frac{\mu - (1 - a)}{a}

Intuition

Markov bounds the upper tail of a nonnegative variable. The same idea applied to $1 - X$ bounds the lower tail of $X$ . The lemma is most useful when $X$ is something like an accuracy in $[0, 1]$ and you want a guarantee that $X$ is rarely far below its mean.

Proof Sketch

$1 - X \in [0, 1]$ is nonnegative with $\mathbb{E}[1 - X] = 1 - \mu$ . Markov's inequality gives $\Pr[1 - X \geq a] \leq (1 - \mu)/a$ . Take complements:

$\Pr[X > 1 - a] = \Pr[1 - X < a] \geq 1 - \frac{1 - \mu}{a} = \frac{\mu - (1 - a)}{a}.$

Why It Matters

This is the form of Markov used in the Shalev-Shwartz and Ben-David proof of agnostic PAC learnability of finite hypothesis classes. The argument needs a lower bound on the probability that the empirical risk is not too far below the true risk; Lemma B.1 supplies that bound from a one-sided expectation control.

Failure Mode

The bound is informative only when $\mu > 1 - a$ . If $\mu \leq 1 - a$ , the right-hand side is non-positive and the lemma reduces to the trivial bound $\Pr[\cdot] \geq 0$ . Like Markov itself, the bound uses only the mean and the boundedness range; it cannot see variance.

report a correction →

Theorem

Chebyshev's Inequality

Statement

For any random variable $X$ with mean $\mu$ and finite variance $\sigma^2$ , for any $t > 0$ :

$\Pr[|X - \mu| \geq t] \leq \frac{\sigma^2}{t^2}$

Exact statement

Pr [∣ X - μ ∣ \geq t] \leq \frac{σ ^{2}}{t ^{2}}

LaTeX source for copy/export

\Pr[|X - \mu| \geq t] \leq \frac{\sigma^2}{t^2}

Intuition

Chebyshev uses variance information: if the spread of $X$ around its mean is small ( $\sigma^2$ is small), then large deviations are unlikely. Applied to the sample mean, the variance decreases as $1/n$ , giving the familiar $O(1/n)$ concentration.

Proof Sketch

Apply Markov's inequality to the nonnegative random variable $(X - \mu)^2$ . The square is monotone on nonnegative arguments, so $|X - \mu| \geq t$ if and only if $(X - \mu)^2 \geq t^2$ :

$\Pr[|X - \mu| \geq t] = \Pr[(X - \mu)^2 \geq t^2] \leq \frac{\mathbb{E}[(X - \mu)^2]}{t^2} = \frac{\sigma^2}{t^2}.$

Chebyshev is Markov applied to the squared deviation. This is the first place in the chain where the variance enters: Markov uses only $\mathbb{E}[X]$ , while Chebyshev squeezes one extra moment out of the same trick.

Why It Matters

Chebyshev gives the first quantitative form of the law of large numbers. For i.i.d. $Z_1, \ldots, Z_m$ with mean $\mu$ and $\operatorname{Var}(Z_i) = \sigma^2$ , the sample mean $\bar Z_m = \frac{1}{m}\sum_i Z_i$ has variance

$\operatorname{Var}(\bar Z_m) = \frac{1}{m^2}\sum_{i=1}^m \operatorname{Var}(Z_i) = \frac{\sigma^2}{m}$

(by independence the cross-terms vanish). Combining this with Chebyshev gives $\Pr[|\bar Z_m - \mu| \geq t] \leq \sigma^2/(m t^2)$ . Equivalently, with probability $\geq 1 - \delta$ , $|\bar Z_m - \mu| \leq \sigma/\sqrt{m\delta}$ . Note the $1/\delta$ dependence, not $\log(1/\delta)$ .

Failure Mode

The $1/t^2$ decay is still polynomial. Chebyshev cannot give you the $\log(1/\delta)$ dependence needed for efficient learning. For bounded random variables, Hoeffding gives exponentially better tails. Chebyshev is most useful when you know the variance but nothing about higher moments or boundedness.

report a correction →

Lemma

Lemma B.2: Chebyshev for the sample mean

Statement

Let $Z_1, \ldots, Z_m$ be i.i.d. with $\mathbb{E}[Z_i] = \mu$ and $\operatorname{Var}(Z_i) \leq 1$ . For any $\delta \in (0, 1)$ , with probability at least $1 - \delta$ :

$\left|\frac{1}{m}\sum_{i=1}^m Z_i - \mu\right| \leq \sqrt{\frac{1}{\delta m}}.$

Exact statement

\frac{1}{m} i = 1 \sum m Z_{i} - μ \leq \frac{1}{δ m}

LaTeX source for copy/export

\left|\frac{1}{m}\sum_{i=1}^m Z_i - \mu\right| \leq \sqrt{\frac{1}{\delta m}}

Intuition

Solve Chebyshev for the deviation. Setting the failure probability $\sigma^2/(m t^2) = \delta$ and inverting for $t$ gives $t = \sigma/\sqrt{m \delta}$ . With the normalization $\sigma^2 \leq 1$ , the radius is $1/\sqrt{m \delta}$ .

Proof Sketch

Apply Chebyshev to $\bar Z_m$ with variance $\sigma^2/m \leq 1/m$ :

$\Pr\!\left[|\bar Z_m - \mu| \geq t\right] \leq \frac{1}{m t^2}.$

Set the right-hand side equal to $\delta$ and solve: $t = 1/\sqrt{\delta m}$ . The complement event has probability $\geq 1 - \delta$ .

Why It Matters

This is the explicit finite- $m$ form behind "the sample mean is close to $\mu$ ." It is the version of Chebyshev quoted as Lemma B.2 in Shalev-Shwartz and Ben-David, Understanding Machine Learning, Appendix B, where it sets up the polynomial sample-complexity baseline that exponential bounds later improve. The decay rate of the radius is $1/\sqrt{\delta m}$ : polynomial in $m$ , with a $1/\sqrt{\delta}$ dependence on the confidence parameter.

Failure Mode

The $1/\sqrt{\delta m}$ rate is too slow for PAC-style sample complexity. For confidence $\delta = 0.05$ and target accuracy $\varepsilon = 0.01$ , the lemma needs $m \geq 1/(\delta \varepsilon^2) = 200{,}000$ samples; Hoeffding under the same boundedness assumption needs roughly $\log(2/\delta)/(2\varepsilon^2) \approx 18{,}500$ , an order of magnitude fewer. The $1/\delta$ dependence is the bottleneck.

report a correction →

From polynomial to exponential

Chebyshev gives a confidence interval of width $1/\sqrt{\delta m}$ — polynomial in $m$ , polynomial in $1/\delta$ . PAC learning needs better.

For accuracy $\varepsilon = 0.01$ at confidence $\delta = 0.05$ , Lemma B.2 demands $m \geq 1/(\delta \varepsilon^2) = 200{,}000$ samples. The bottleneck is the $1/\delta$ dependence: halving the failure probability doubles the required $m$ .

Hoeffding's inequality, derived next, replaces this with exponential decay: for bounded random variables in $[0, 1]$ ,

$\Pr\!\left[|\bar Z_m - \mu| \geq \varepsilon\right] \leq 2 \exp(-2 m \varepsilon^2).$

Inverting: $m \geq \log(2/\delta)/(2 \varepsilon^2)$ . The same $\varepsilon = 0.01$ , $\delta = 0.05$ now needs $m \approx 18{,}500$ — about $10$ times fewer samples than Chebyshev. More importantly, halving $\delta$ adds only $\log 2 / (2 \varepsilon^2) \approx 3{,}500$ samples, a constant overhead instead of doubling.

The mechanism is the same Markov trick, applied to a different function. Markov uses only $\mathbb{E}[X]$ . Chebyshev applies Markov to $(X - \mu)^2$ and picks up the variance. The Chernoff/MGF method (used by Hoeffding and Bernstein) applies Markov to $e^{\lambda (X - \mu)}$ — the moment generating function — and picks up all polynomial moments at once. The exponential function dominates every polynomial; that is why exponential tails dominate polynomial tails.

Inequality	Applied to	Tail decay in $m$ at fixed $\varepsilon$	$\delta$ dependence
Markov	$X$	none	$1/\delta$
Chebyshev	$(X - \mu)^2$	$1/m$	$1/\delta$
Hoeffding	$e^{\lambda (X - \mu)}$	$\exp(-c m \varepsilon^2)$	$\log(1/\delta)$

The polynomial-vs-exponential gap is what makes finite-sample learning theory work.

Lemma

Sub-Gaussian Finite-Sum Tail Bound

Statement

Let $(Y_i)_{i \in I}$ be a finite independent family of centered sub-Gaussian random variables with MGF parameters $c_i$ . Then for any $\epsilon \geq 0$ :

\Pr\!\left[\sum_{i \in I} Y_i \geq \epsilon\right] \leq \exp\!\left(-\frac{\epsilon^2}{2\sum_{i \in I} c_i}\right).

Exact statement

Pr [i \in I \sum Y_{i} \geq ϵ] \leq exp (- \frac{ϵ ^{2}}{2 \sum _{i \in I} c _{i}})

LaTeX source for copy/export

\Pr\!\left[\sum_{i \in I} Y_i \geq \epsilon\right] \leq \exp\!\left(-\frac{\epsilon^2}{2\sum_{i \in I} c_i}\right)

Intuition

Sub-Gaussian MGF control behaves like variance under independent sums: the parameters add. Markov's inequality applied to $\exp(\lambda \sum_i Y_i)$ then converts that MGF bound into an exponential tail bound.

Proof Sketch

Independence factors the MGF of the sum into a product of individual MGFs. The sub-Gaussian assumption bounds each factor by $\exp(c_i \lambda^2 / 2)$ , so the product is bounded by $\exp(\lambda^2 \sum_i c_i / 2)$ . Apply Markov to the exponential transform and optimize $\lambda$ .

Why It Matters

This is the exact formal bridge between concentration and finite-class learning bounds. Hoeffding for bounded losses first proves sub-Gaussian MGF control; the finite union bound then lifts the single-hypothesis tail bound to all hypotheses.

Failure Mode

The conclusion is a one-sided tail bound for a finite independent sum. It does not by itself prove the bounded-variable Hoeffding theorem or a uniform convergence theorem; those need additional assumptions and a union-bound step.

report a correction →

Theorem

Hoeffding One-Sided Finite-Sum Bound

Statement

Let $(X_i)_{i \in I}$ be a finite independent family of real random variables with $a_i \leq X_i \leq b_i$ almost surely. Then for $\epsilon \geq 0$ :

\Pr\!\left[\sum_{i \in I} (X_i-\mathbb{E}X_i) \geq \epsilon\right] \leq \exp\!\left(-\frac{\epsilon^2}{2\sum_{i \in I} ((b_i-a_i)/2)^2}\right).

Exact statement

Pr [i \in I \sum (X_{i} - E X_{i}) \geq ϵ] \leq exp (- \frac{ϵ ^{2}}{2 \sum _{i \in I} (( b _{i} - a _{i} ) /2 ) ^{2}})

LaTeX source for copy/export

\Pr\!\left[\sum_{i \in I} (X_i-\mathbb{E}X_i) \geq \epsilon\right] \leq \exp\!\left(-\frac{\epsilon^2}{2\sum_{i \in I} ((b_i-a_i)/2)^2}\right)

Intuition

Bounded variables are sub-Gaussian after centering. Once each centered variable has sub-Gaussian MGF parameter $((b_i-a_i)/2)^2$ , independence lets those parameters add across the finite sum.

Proof Sketch

Apply Hoeffding's lemma to each bounded variable to obtain sub-Gaussian MGF control for $X_i-\mathbb{E}X_i$ . Independence preserves that control under finite summation, and the sub-Gaussian finite-sum tail bound above converts the MGF statement into the displayed probability bound.

Why It Matters

This is the claim-facing bridge between bounded losses and finite-class uniform convergence. Setting $\epsilon = nt$ gives the one-sided average Hoeffding form; applying the same argument to $-X_i$ and union-bounding the two tails gives the common two-sided display.

Failure Mode

This statement covers the one-sided centered finite-sum bound. It does not by itself verify the two-sided display, the identical-bound special case, or a uniform convergence theorem; those need their own scoped claims or corollaries.

report a correction →

Theorem

Hoeffding's Inequality

Statement

Let $X_1, \ldots, X_n$ be independent random variables with $a_i \leq X_i \leq b_i$ almost surely. Then for any $t > 0$ :

$\Pr\!\left[\bar{X}_n - \mu \geq t\right] \leq \exp\!\left(-\frac{2n^2t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right)$

For the two-sided version:

$\Pr\!\left[|\bar{X}_n - \mu| \geq t\right] \leq 2\exp\!\left(-\frac{2n^2t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right)$

Special case: If all $X_i \in [a, b]$ (identically bounded):

$\Pr[|\bar{X}_n - \mu| \geq t] \leq 2\exp\!\left(-\frac{2nt^2}{(b-a)^2}\right)$

Exact statement

Pr [∣ \overset{ˉ}{X}_{n} - μ ∣ \geq t] \leq 2 exp (- \frac{2 n ^{2} t ^{2}}{\sum _{i = 1}^{n} ( b _{i} - a _{i} ) ^{2}})

LaTeX source for copy/export

\Pr\!\left[|\bar{X}_n - \mu| \geq t\right] \leq 2\exp\!\left(-\frac{2n^2t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right)

Intuition

Bounded random variables have sub-Gaussian tails. The width of the bounding interval $(b_i - a_i)$ controls the "effective variance" of each variable. The bound says: the probability of a deviation of size $t$ decays exponentially in $nt^2$ , with the rate determined by how spread out the bounds are.

Proof Sketch

The proof uses the Chernoff/MGF method:

For any $\lambda > 0$ : $\Pr[\bar{X}_n - \mu \geq t] = \Pr[e^{\lambda(\bar{X}_n - \mu)} \geq e^{\lambda t}] \leq e^{-\lambda t} \mathbb{E}[e^{\lambda(\bar{X}_n - \mu)}]$
By independence: $\mathbb{E}[e^{\lambda(\bar{X}_n - \mu)}] = \prod_{i=1}^n \mathbb{E}[e^{\lambda(X_i - \mu_i)/n}]$
Hoeffding's lemma: If $a \leq X \leq b$ , then $\mathbb{E}[e^{\lambda(X - \mathbb{E}[X])}] \leq \exp(\lambda^2(b-a)^2/8)$
Combine and optimize over $\lambda$ (set $\lambda = 4nt/\sum(b_i - a_i)^2$ ).

Why It Matters

Hoeffding is the workhorse of learning theory. The ERM generalization bound for finite classes, the uniform convergence bound, and many other results use Hoeffding at their core. The exponential tail gives $n = O((b-a)^2 \log(1/\delta)/\epsilon^2)$ . The $\log(1/\delta)$ dependence is what makes high-confidence bounds feasible.

Failure Mode

Hoeffding uses only the range $[a_i, b_i]$ , not the actual variance. If the variance is much smaller than $(b_i - a_i)^2/4$ (e.g., a variable that is usually 0 but can be as large as 1), Hoeffding is loose. Bernstein fixes this by incorporating variance information.

report a correction →

Theorem

Bernstein's Inequality

Statement

Let $X_1, \ldots, X_n$ be independent with $\mathbb{E}[X_i] = \mu_i$ , $\text{Var}(X_i) = \sigma_i^2$ , and $|X_i - \mu_i| \leq M$ almost surely. Then for any $t > 0$ :

$\Pr\!\left[\sum_{i=1}^n (X_i - \mu_i) \geq t\right] \leq \exp\!\left(-\frac{t^2/2}{\sum_{i=1}^n \sigma_i^2 + Mt/3}\right)$

For the sample mean with i.i.d. variables ( $\sigma^2 = \text{Var}(X_i)$ ):

$\Pr[|\bar{X}_n - \mu| \geq t] \leq 2\exp\!\left(-\frac{nt^2/2}{\sigma^2 + Mt/3}\right)$

Exact statement

Pr [i = 1 \sum n (X_{i} - μ_{i}) \geq t] \leq exp (- \frac{t ^{2} /2}{\sum _{i = 1}^{n} σ _{i}^{2} + M t /3})

LaTeX source for copy/export

\Pr\!\left[\sum_{i=1}^n (X_i - \mu_i) \geq t\right] \leq \exp\!\left(-\frac{t^2/2}{\sum_{i=1}^n \sigma_i^2 + Mt/3}\right)

Intuition

Bernstein interpolates between two regimes. For small deviations ( $t \ll \sigma^2/M$ ), the $\sigma^2$ term dominates the denominator and the bound looks like $\exp(-nt^2/2\sigma^2)$ , a variance-dependent Gaussian tail. For large deviations ( $t \gg \sigma^2/M$ ), the $Mt/3$ term dominates and the bound looks like $\exp(-3nt/2M)$ , a sub-exponential tail. The variance term makes Bernstein much tighter than Hoeffding when the variance is small relative to the range.

Proof Sketch

Like Hoeffding, use the Chernoff method. The key improvement is a tighter bound on the MGF:

If $|X - \mu| \leq M$ and $\text{Var}(X) = \sigma^2$ , then $\mathbb{E}[e^{\lambda(X-\mu)}] \leq \exp\!\left(\frac{\lambda^2 \sigma^2/2}{1 - \lambda M/3}\right)$ for $0 < \lambda < 3/M$ .
This bound uses the variance $\sigma^2$ instead of the crude range bound $(b-a)^2/4$ that Hoeffding uses.
Multiply over independent variables and optimize $\lambda$ .

Why It Matters

Bernstein is the right inequality when you have variance information. In learning theory, this matters for sparse or low-noise settings. For example, if the loss is usually close to zero (low empirical risk, low variance), Bernstein gives much tighter bounds than Hoeffding, because Hoeffding only knows the loss is in $[0, 1]$ while Bernstein knows the variance is small.

Failure Mode

Bernstein requires knowing (or bounding) the variance, which is not always available. If you only know the range $[a, b]$ , Hoeffding is simpler and sufficient. In practice, you can sometimes use an empirical estimate of the variance and apply Bernstein, but this introduces additional technical complications (you need a concentration bound on the variance estimate itself).

report a correction →

Proof Ideas and Templates Used

All four inequalities follow a common escalation pattern:

Markov's trick: the universal first step: convert a tail probability into an expectation bound via $\Pr[X \geq t] \leq \mathbb{E}[g(X)]/g(t)$ for any increasing $g$ .
Moment method: apply Markov to $(X - \mu)^{2k}$ to get $\Pr[|X - \mu| \geq t] \leq \mathbb{E}[(X-\mu)^{2k}]/t^{2k}$ . Chebyshev is the $k = 1$ case.
MGF/Chernoff method: apply Markov to $e^{\lambda X}$ for optimal $\lambda$ . This is exponentially stronger because $e^{\lambda X}$ grows much faster than any polynomial. Both Hoeffding and Bernstein use this.

The key lemma in each exponential bound is controlling $\mathbb{E}[e^{\lambda(X - \mu)}]$ :

Hoeffding's lemma bounds it using only boundedness: $\leq e^{\lambda^2(b-a)^2/8}$
Bernstein's condition bounds it using variance: $\leq e^{\lambda^2\sigma^2/2 \cdot (1 - \lambda M/3)^{-1}}$

Canonical Examples

Example

Estimating a coin's bias with Hoeffding

Flip a coin $n$ times. Each flip $X_i \in \{0, 1\}$ with $\mathbb{E}[X_i] = p$ . The sample mean $\bar{X}_n$ estimates $p$ . Hoeffding with $a = 0, b = 1$ gives:

$\Pr[|\bar{X}_n - p| \geq \epsilon] \leq 2e^{-2n\epsilon^2}$

For $\epsilon = 0.01$ and $\delta = 0.05$ : $n \geq \frac{\log(2/0.05)}{2(0.01)^2} = \frac{3.69}{0.0002} \approx 18{,}445$ .

If we know $p \approx 0.01$ (rare event), Bernstein with $\sigma^2 = p(1-p) \approx 0.01$ and $M \leq 1$ gives roughly $n \approx 985$ — about 19 times fewer samples. Bernstein exploits the small variance.

Example

Comparing Markov, Chebyshev, and Hoeffding

Let $X_1, \ldots, X_{100}$ be i.i.d. uniform on $[0, 1]$ . We want $\Pr[|\bar{X}_{100} - 0.5| \geq 0.1]$ .

Markov + Jensen (applied to $|\bar{X} - 0.5|$ , crude): first note that

$\operatorname{Var}(\bar{X}) = \frac{1}{1200} \approx 8.33 \times 10^{-4}.$

So the standard deviation of $\bar{X}$ is about $0.0289$ .

Then

$\Pr[|\bar{X}_{100} - 0.5| \geq 0.1] \leq \frac{\mathbb{E}[|\bar{X} - 0.5|]}{0.1} \leq \frac{\sqrt{\operatorname{Var}(\bar{X})}}{0.1} \approx \frac{0.0289}{0.1} \approx 0.289.$

The first step is Markov on the non-negative random variable $|\bar{X} - 0.5|$ ; the second step is Jensen's inequality applied to $z \mapsto \sqrt{z}$ , giving $\mathbb{E}[|\bar{X}-0.5|] \leq \sqrt{\mathbb{E}[(\bar{X}-0.5)^2]} = \sqrt{\operatorname{Var}(\bar{X})}$ . Pure Markov on $|\bar{X} - 0.5|$ alone, without Jensen, would only give a bound in terms of $\mathbb{E}|\bar{X} - 0.5|$ and not in terms of the standard deviation.

Chebyshev: $\leq \text{Var}(\bar{X})/0.01 = (1/12 \cdot 1/100)/0.01 = 0.083$ .

Hoeffding: $\leq 2e^{-2 \cdot 100 \cdot 0.01} = 2e^{-2} \approx 0.27$ . Hoeffding gives 0.27, which is worse than Chebyshev's 0.083.

Why: the Hoeffding exponent here is $2nt^2 = 2$ , so $e^{-2} \approx 0.135$ and the two-sided bound is $2e^{-2} \approx 0.27$ . Chebyshev's $\sigma^2/(n t^2) = (1/12)/1 \approx 0.083$ is tighter because the exponent is simply not large enough for $e^{-2nt^2}$ to beat $1/t^2$ at this $(n, t)$ . Hoeffding's exponential form dominates once $2nt^2$ is sufficiently larger than $2\log(1/t^2)$ . For $n = 400$ at the same $t = 0.1$ , Hoeffding gives $2e^{-8} \approx 6.7 \times 10^{-4}$ , beating Chebyshev's $1/48 \approx 0.021$ .

No single inequality is universally best at all sample sizes.

Example

Hoeffding in learning theory: the finite-class ERM bound

The ERM generalization bound for $|\mathcal{H}| < \infty$ :

For each fixed $h$ , the loss $\ell(h(x_i), y_i)$ is a bounded random variable in $[0, 1]$ . The empirical risk $\hat{R}_n(h)$ is the sample mean. Hoeffding gives:

$\Pr[|R(h) - \hat{R}_n(h)| > \epsilon] \leq 2e^{-2n\epsilon^2}$

Union bound over all $h \in \mathcal{H}$ :

$\Pr[\sup_h |R(h) - \hat{R}_n(h)| > \epsilon] \leq 2|\mathcal{H}|e^{-2n\epsilon^2}$

This is how Hoeffding feeds directly into generalization bounds.

Common Confusions

Watch Out

Hoeffding requires bounded random variables, not just bounded variance

Hoeffding's inequality assumes $a_i \leq X_i \leq b_i$ almost surely. The random variables must be literally bounded. It does not apply to Gaussian random variables (which are unbounded), even though they have finite variance. For Gaussians, you use the exact sub-Gaussian tail $\Pr[X - \mu \geq t] = \Phi(-t/\sigma)$ , or you can use the sub-Gaussian framework that generalizes Hoeffding. If someone applies Hoeffding to an unbounded variable, the result is invalid.

Watch Out

Bernstein is better than Hoeffding when variance is small, not always

Bernstein's bound is $\exp(-nt^2/(2\sigma^2 + 2Mt/3))$ , while Hoeffding gives $\exp(-2nt^2/(b-a)^2)$ . If $\sigma^2 \approx (b-a)^2/4$ (i.e., the variance is as large as it can be for the given range), Bernstein offers little or no improvement. Bernstein shines when $\sigma^2 \ll (b-a)^2$ . For example, when the random variable is usually near zero but has occasional large values. In the worst case over all distributions with given range, Hoeffding and Bernstein are comparable.

Watch Out

The two-sided Hoeffding bound is 2x the one-sided bound, not squared

A common error: writing $\Pr[|\bar{X} - \mu| \geq t] \leq 2e^{-2nt^2/(b-a)^2}$ . The factor of 2 comes from the union bound over the two tails: $\Pr[|\bar{X} - \mu| \geq t] \leq \Pr[\bar{X} - \mu \geq t] + \Pr[\mu - \bar{X} \geq t]$ . Students sometimes confuse this with squaring the one-sided probability.

Summary

Markov: $\Pr[X \geq t] \leq \mathbb{E}[X]/t$ . Weakest, needs only $X \geq 0$
Chebyshev: $\Pr[|X - \mu| \geq t] \leq \sigma^2/t^2$ . Uses variance, $1/t^2$ decay
Hoeffding: $\Pr[|\bar{X}_n - \mu| \geq t] \leq 2\exp(-2nt^2/(b-a)^2)$ . Exponential decay, needs boundedness
Bernstein: like Hoeffding but uses variance $\sigma^2$ . Tighter when $\sigma^2 \ll (b-a)^2$
Exponential tails give $\log(1/\delta)$ dependence in sample complexity; polynomial tails give $1/\delta$
The Chernoff/MGF method is the universal technique for deriving exponential bounds
In learning theory, Hoeffding is the default; use Bernstein when you have variance information

Exercises

ExerciseCore

Problem

Using Hoeffding's inequality, how many fair coin flips do you need to estimate the probability of heads to within $\pm 0.02$ with probability at least 0.99?

ExerciseCore

Problem

Suppose $X_1, \ldots, X_n$ are i.i.d. with $X_i \in [0, 1]$ , $\mathbb{E}[X_i] = 0.01$ , and $\text{Var}(X_i) = 0.01$ . Compare the sample sizes needed by Hoeffding and Bernstein to guarantee $\Pr[|\bar{X}_n - 0.01| \geq 0.01] \leq 0.05$ .

ExerciseAdvanced

Problem

Prove Hoeffding's lemma: if $a \leq X \leq b$ and $\mathbb{E}[X] = 0$ , then for all $\lambda > 0$ :

$\mathbb{E}[e^{\lambda X}] \leq \exp\!\left(\frac{\lambda^2(b-a)^2}{8}\right)$

Related Comparisons

Hoeffding vs. Bernstein Inequality

References

Reference note: this page is a pedagogical guide to the standard scalar concentration inequalities, not an exhaustive survey of martingale, matrix-valued, or geometric concentration.

Canonical:

Hoeffding, W. (1963). "Probability Inequalities for Sums of Bounded Random Variables." Journal of the American Statistical Association.
Boucheron, Lugosi, Massart, Concentration Inequalities: A Nonasymptotic Theory of Independence (2013), Chapters 2-3
Shalev-Shwartz & Ben-David, Understanding Machine Learning, Appendix B

Current:

Vershynin, High-Dimensional Probability (2018), Chapters 2-3
Wainwright, High-Dimensional Statistics (2019), Chapter 2
van Handel, Probability in High Dimension (2016), Chapters 1-3

Next Topics

Building on these basic inequalities:

Sub-Gaussian random variables: the general framework unifying Hoeffding-type bounds
Bernstein inequality: the variance-sensitive scalar bound
Sub-exponential random variables: handling heavier tails (chi-squared, exponential distributions)
McDiarmid's inequality: concentration for functions of independent variables, not just sums
Contraction inequality: how Lipschitz transformations preserve concentration

Last reviewed: April 28, 2026

Claim evidence

Selected claims on this topic have machine-checked support.

Collapsed by default because this is audit material. Open it to see exact theorem names, claim scopes, and source roles.

Click to expand

4

claim matches

0

dependency proofs

0

incomplete markers

This is claim-level evidence, not a whole-page badge. The checked theorem must match the recorded claim scope. Supporting lemmas stay labeled as dependency proofs, not full claim matches.

Chebyshev inequality

Variance control gives a tail bound for deviations away from the mean.

Matches claim

Connects variance, moments, and concentration before stronger exponential bounds.

Formal record

Checked theorem: TheoremPath.Probability.Concentration.chebyshevInequalityVariance
Claim scope: real variance chebyshev inequality
Proof scope: exact mathlib wrapper for chebyshev
Mathlib theorem: ProbabilityTheory.meas_ge_le_variance_div_sq

Hoeffding one-sided finite-sum bound

A bounded independent finite sum has a one-sided exponential deviation bound.

Matches claim

A core theorem behind bounded-loss learning guarantees.

Formal record

Checked theorem: TheoremPath.Probability.Concentration.hoeffdingBoundedFiniteSumTail
Claim scope: finite centered sum bounded hoeffding one sided
Proof scope: exact mathlib wrapper for bounded hoeffding finite sum tail
Mathlib theorem: ProbabilityTheory.hasSubgaussianMGF_of_mem_Icc, ProbabilityTheory.iIndepFun.comp, ProbabilityTheory.HasSubgaussianMGF.measure_sum_ge_le_of_iIndepFun

Markov inequality

A nonnegative random variable with small expectation cannot be large very often.

Matches claim

First bridge from expectation control to a probability tail bound.

Formal record

Checked theorem: TheoremPath.Probability.Concentration.markovInequalityRealIntegrable
Claim scope: real integrable nonnegative markov inequality
Proof scope: exact mathlib wrapper for markov
Mathlib theorem: MeasureTheory.mul_meas_ge_le_integral_of_nonneg

Sub-Gaussian finite-sum tail bound

Independent sub-Gaussian variables have a finite-sum exponential tail bound.

Matches claim

Supports the concentration path toward finite-class generalization.

Formal record

Checked theorem: TheoremPath.Probability.Concentration.subGaussianFiniteSumTailBound
Claim scope: finite sum subgaussian tail bound
Proof scope: exact mathlib wrapper for subgaussian finite sum tail bound
Mathlib theorem: ProbabilityTheory.HasSubgaussianMGF.measure_sum_ge_le_of_iIndepFun

See the public evidence page for the display rules and representative Lean mapping examples.

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

12

Common Inequalitieslayer 0A · tier 1
Common Probability Distributionslayer 0A · tier 1
Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
Sets, Functions, and Relationslayer 0A · tier 1
Central Limit Theoremlayer 0B · tier 1

Derived topics

27

Chernoff Boundslayer 1 · tier 1
Hoeffding's Lemmalayer 1 · tier 1
PAC Learning Frameworklayer 1 · tier 1
Bennett's Inequalitylayer 2 · tier 1
Bernstein Inequalitylayer 2 · tier 1

+22 more on the derived-topics page.