Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Concentration Probability

Concentration Inequalities

Bounds on how far random variables deviate from their expectations: Markov, Chebyshev, Hoeffding, and Bernstein. Used throughout generalization theory, bandits, and sample complexity.

CoreTier 1Stable~70 min

Why This Matters

Concentration inequalities are the backbone of statistical learning theory. Every generalization bound, every sample complexity result, every PAC-learning theorem ultimately rests on a statement of the form: "the empirical average is close to the true expectation with high probability."

When you see a bound like R^n(h)R(h)\hat{R}_n(h) \approx R(h) in ERM theory, the proof always invokes a concentration inequality. Without these tools, you cannot prove anything about learning from finite data.

Tail Probability Decay: P(X > t) on log scale

10^010^-110^-210^-310^-410^-510^-60123456t (standard deviations from mean)GaussianSub-GaussianSub-ExponentialHeavy-tailed

These are the four inequalities you will use most often, in order of increasing power and increasing assumptions:

  1. Markov: requires only non-negativity
  2. Chebyshev: requires finite variance
  3. Hoeffding: requires bounded random variables
  4. Bernstein: requires bounded range and uses variance information

Mental Model

Think of concentration inequalities as answering one question: how likely is it that the sample average Xˉn=1ni=1nXi\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i deviates from E[X]\mathbb{E}[X] by more than tt?

Each inequality trades assumptions for power:

  • Markov/Chebyshev give polynomial tails (1/t21/t^2): slow decay.
  • Hoeffding/Bernstein give exponential tails (ecnt2e^{-cnt^2}): fast decay.

The exponential tail bounds are what make learning theory work. With polynomial tails, you need n=O(1/ϵ2δ)n = O(1/\epsilon^2 \delta) samples. With exponential tails, you need n=O(log(1/δ)/ϵ2)n = O(\log(1/\delta)/\epsilon^2): the dependence on the confidence δ\delta drops inside a logarithm.

Formal Setup and Notation

Let X1,X2,,XnX_1, X_2, \ldots, X_n be independent random variables. We write Xˉn=1ni=1nXi\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i for the sample mean and μ=E[Xˉn]\mu = \mathbb{E}[\bar{X}_n] for its expectation (which equals E[X1]\mathbb{E}[X_1] when the XiX_i are identically distributed).

We seek upper bounds on the tail probability:

Pr[Xˉnμt]\Pr[|\bar{X}_n - \mu| \geq t]

or equivalently the one-sided tail Pr[Xˉnμt]\Pr[\bar{X}_n - \mu \geq t].

Core Definitions

Definition

Tail Probability

The tail probability of a random variable XX at level tt is Pr[Xt]\Pr[X \geq t]. Concentration inequalities provide upper bounds on tail probabilities, showing that XX is unlikely to deviate far from its mean. A bound is useful when it decays rapidly in tt.

Definition

Sub-Gaussian Behavior

A random variable exhibits sub-Gaussian behavior if its tail probability decays like ect2e^{-ct^2} for some constant c>0c > 0. Bounded random variables are sub-Gaussian (this is Hoeffding). The Gaussian decay rate is the "gold standard" for tail bounds. It means large deviations are exponentially unlikely.

Definition

Moment Generating Function

The moment generating function (MGF) of XX is MX(λ)=E[eλX]M_X(\lambda) = \mathbb{E}[e^{\lambda X}]. The MGF method — bounding the MGF and then optimizing over λ\lambda — is the standard technique for deriving exponential concentration inequalities. It is also called the Chernoff bounding method (Chernoff 1952); Wainwright (2019) uses this terminology. Cramér's theorem is a distinct, asymptotic result about the large-deviations rate function.

Main Theorems

Theorem

Markov's Inequality

Statement

If X0X \geq 0, then for any t>0t > 0:

Pr[Xt]E[X]t\Pr[X \geq t] \leq \frac{\mathbb{E}[X]}{t}

Intuition

If the average value of XX is small, then XX cannot be large too often. If E[X]=1\mathbb{E}[X] = 1, then XX can be 100\geq 100 at most 1% of the time; otherwise the average would be too high.

Proof Sketch

E[X]=E[X1[Xt]]+E[X1[X<t]]E[X1[Xt]]tPr[Xt]\mathbb{E}[X] = \mathbb{E}[X \cdot \mathbf{1}[X \geq t]] + \mathbb{E}[X \cdot \mathbf{1}[X < t]] \geq \mathbb{E}[X \cdot \mathbf{1}[X \geq t]] \geq t \cdot \Pr[X \geq t]. Divide both sides by tt.

Why It Matters

Markov's inequality is the foundation that all other concentration inequalities build upon. Chebyshev applies Markov to (Xμ)2(X - \mu)^2. Hoeffding and Bernstein apply Markov to eλ(Xμ)e^{\lambda(X - \mu)}. It is the weakest bound but requires the fewest assumptions.

Failure Mode

The bound Pr[Xt]μ/t\Pr[X \geq t] \leq \mu/t decays only as 1/t1/t, polynomially. This is far too slow for most applications. If E[X]=1\mathbb{E}[X] = 1, Markov gives Pr[X10]1/10\Pr[X \geq 10] \leq 1/10, independent of any variance or boundedness structure. For well-behaved distributions the true probability is astronomically smaller. Markov uses only E[X]\mathbb{E}[X]; it cannot see σ\sigma. Chebyshev is the first inequality that brings variance information into the picture. You almost always want Hoeffding or Bernstein when those assumptions hold.

Theorem

Chebyshev's Inequality

Statement

For any random variable XX with mean μ\mu and variance σ2\sigma^2, for any t>0t > 0:

Pr[Xμt]σ2t2\Pr[|X - \mu| \geq t] \leq \frac{\sigma^2}{t^2}

For the sample mean of nn i.i.d. variables with variance σ2\sigma^2:

Pr[Xˉnμt]σ2nt2\Pr[|\bar{X}_n - \mu| \geq t] \leq \frac{\sigma^2}{nt^2}

Intuition

Chebyshev uses variance information: if the spread of XX around its mean is small (σ2\sigma^2 is small), then large deviations are unlikely. Applied to the sample mean, the variance decreases as 1/n1/n, giving the familiar O(1/n)O(1/n) concentration.

Proof Sketch

Apply Markov's inequality to the non-negative random variable (Xμ)2(X - \mu)^2:

Pr[Xμt]=Pr[(Xμ)2t2]E[(Xμ)2]t2=σ2t2\Pr[|X - \mu| \geq t] = \Pr[(X - \mu)^2 \geq t^2] \leq \frac{\mathbb{E}[(X - \mu)^2]}{t^2} = \frac{\sigma^2}{t^2}

Why It Matters

Chebyshev gives the first quantitative form of the law of large numbers: with probability 1δ\geq 1 - \delta, Xˉnμσ/nδ|\bar{X}_n - \mu| \leq \sigma/\sqrt{n\delta}. This requires n=O(σ2/ϵ2δ)n = O(\sigma^2/\epsilon^2\delta) for an ϵ\epsilon-accurate estimate. Note the 1/δ1/\delta dependence, not log(1/δ)\log(1/\delta).

Failure Mode

The 1/t21/t^2 decay is still polynomial. Chebyshev cannot give you the log(1/δ)\log(1/\delta) dependence needed for efficient learning. For bounded random variables, Hoeffding gives exponentially better tails. Chebyshev is most useful when you know the variance but nothing about higher moments or boundedness.

Theorem

Hoeffding's Inequality

Statement

Let X1,,XnX_1, \ldots, X_n be independent random variables with aiXibia_i \leq X_i \leq b_i almost surely. Then for any t>0t > 0:

Pr ⁣[Xˉnμt]exp ⁣(2n2t2i=1n(biai)2)\Pr\!\left[\bar{X}_n - \mu \geq t\right] \leq \exp\!\left(-\frac{2n^2t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right)

For the two-sided version:

Pr ⁣[Xˉnμt]2exp ⁣(2n2t2i=1n(biai)2)\Pr\!\left[|\bar{X}_n - \mu| \geq t\right] \leq 2\exp\!\left(-\frac{2n^2t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right)

Special case: If all Xi[a,b]X_i \in [a, b] (identically bounded):

Pr[Xˉnμt]2exp ⁣(2nt2(ba)2)\Pr[|\bar{X}_n - \mu| \geq t] \leq 2\exp\!\left(-\frac{2nt^2}{(b-a)^2}\right)

Intuition

Bounded random variables have sub-Gaussian tails. The width of the bounding interval (biai)(b_i - a_i) controls the "effective variance" of each variable. The bound says: the probability of a deviation of size tt decays exponentially in nt2nt^2, with the rate determined by how spread out the bounds are.

Proof Sketch

The proof uses the Chernoff/MGF method:

  1. For any λ>0\lambda > 0: Pr[Xˉnμt]=Pr[eλ(Xˉnμ)eλt]eλtE[eλ(Xˉnμ)]\Pr[\bar{X}_n - \mu \geq t] = \Pr[e^{\lambda(\bar{X}_n - \mu)} \geq e^{\lambda t}] \leq e^{-\lambda t} \mathbb{E}[e^{\lambda(\bar{X}_n - \mu)}]

  2. By independence: E[eλ(Xˉnμ)]=i=1nE[eλ(Xiμi)/n]\mathbb{E}[e^{\lambda(\bar{X}_n - \mu)}] = \prod_{i=1}^n \mathbb{E}[e^{\lambda(X_i - \mu_i)/n}]

  3. Hoeffding's lemma: If aXba \leq X \leq b, then E[eλ(XE[X])]exp(λ2(ba)2/8)\mathbb{E}[e^{\lambda(X - \mathbb{E}[X])}] \leq \exp(\lambda^2(b-a)^2/8)

  4. Combine and optimize over λ\lambda (set λ=4nt/(biai)2\lambda = 4nt/\sum(b_i - a_i)^2).

Why It Matters

Hoeffding is the workhorse of learning theory. The ERM generalization bound for finite classes, the uniform convergence bound, and many other results use Hoeffding at their core. The exponential tail gives n=O((ba)2log(1/δ)/ϵ2)n = O((b-a)^2 \log(1/\delta)/\epsilon^2). The log(1/δ)\log(1/\delta) dependence is what makes high-confidence bounds feasible.

Failure Mode

Hoeffding uses only the range [ai,bi][a_i, b_i], not the actual variance. If the variance is much smaller than (biai)2/4(b_i - a_i)^2/4 (e.g., a variable that is usually 0 but can be as large as 1), Hoeffding is loose. Bernstein fixes this by incorporating variance information.

Theorem

Bernstein's Inequality

Statement

Let X1,,XnX_1, \ldots, X_n be independent with E[Xi]=μi\mathbb{E}[X_i] = \mu_i, Var(Xi)=σi2\text{Var}(X_i) = \sigma_i^2, and XiμiM|X_i - \mu_i| \leq M almost surely. Then for any t>0t > 0:

Pr ⁣[i=1n(Xiμi)t]exp ⁣(t2/2i=1nσi2+Mt/3)\Pr\!\left[\sum_{i=1}^n (X_i - \mu_i) \geq t\right] \leq \exp\!\left(-\frac{t^2/2}{\sum_{i=1}^n \sigma_i^2 + Mt/3}\right)

For the sample mean with i.i.d. variables (σ2=Var(Xi)\sigma^2 = \text{Var}(X_i)):

Pr[Xˉnμt]2exp ⁣(nt2/2σ2+Mt/3)\Pr[|\bar{X}_n - \mu| \geq t] \leq 2\exp\!\left(-\frac{nt^2/2}{\sigma^2 + Mt/3}\right)

Intuition

Bernstein interpolates between two regimes. For small deviations (tσ2/Mt \ll \sigma^2/M), the σ2\sigma^2 term dominates the denominator and the bound looks like exp(nt2/2σ2)\exp(-nt^2/2\sigma^2), a variance-dependent Gaussian tail. For large deviations (tσ2/Mt \gg \sigma^2/M), the Mt/3Mt/3 term dominates and the bound looks like exp(3nt/2M)\exp(-3nt/2M), a sub-exponential tail. The variance term makes Bernstein much tighter than Hoeffding when the variance is small relative to the range.

Proof Sketch

Like Hoeffding, use the Chernoff method. The key improvement is a tighter bound on the MGF:

  1. If XμM|X - \mu| \leq M and Var(X)=σ2\text{Var}(X) = \sigma^2, then E[eλ(Xμ)]exp ⁣(λ2σ2/21λM/3)\mathbb{E}[e^{\lambda(X-\mu)}] \leq \exp\!\left(\frac{\lambda^2 \sigma^2/2}{1 - \lambda M/3}\right) for 0<λ<3/M0 < \lambda < 3/M.

  2. This bound uses the variance σ2\sigma^2 instead of the crude range bound (ba)2/4(b-a)^2/4 that Hoeffding uses.

  3. Multiply over independent variables and optimize λ\lambda.

Why It Matters

Bernstein is the right inequality when you have variance information. In learning theory, this matters for sparse or low-noise settings. For example, if the loss is usually close to zero (low empirical risk, low variance), Bernstein gives much tighter bounds than Hoeffding, because Hoeffding only knows the loss is in [0,1][0, 1] while Bernstein knows the variance is small.

Failure Mode

Bernstein requires knowing (or bounding) the variance, which is not always available. If you only know the range [a,b][a, b], Hoeffding is simpler and sufficient. In practice, you can sometimes use an empirical estimate of the variance and apply Bernstein, but this introduces additional technical complications (you need a concentration bound on the variance estimate itself).

Proof Ideas and Templates Used

All four inequalities follow a common escalation pattern:

  1. Markov's trick: the universal first step: convert a tail probability into an expectation bound via Pr[Xt]E[g(X)]/g(t)\Pr[X \geq t] \leq \mathbb{E}[g(X)]/g(t) for any increasing gg.

  2. Moment method: apply Markov to (Xμ)2k(X - \mu)^{2k} to get Pr[Xμt]E[(Xμ)2k]/t2k\Pr[|X - \mu| \geq t] \leq \mathbb{E}[(X-\mu)^{2k}]/t^{2k}. Chebyshev is the k=1k = 1 case.

  3. MGF/Chernoff method: apply Markov to eλXe^{\lambda X} for optimal λ\lambda. This is exponentially stronger because eλXe^{\lambda X} grows much faster than any polynomial. Both Hoeffding and Bernstein use this.

The key lemma in each exponential bound is controlling E[eλ(Xμ)]\mathbb{E}[e^{\lambda(X - \mu)}]:

  • Hoeffding's lemma bounds it using only boundedness: eλ2(ba)2/8\leq e^{\lambda^2(b-a)^2/8}
  • Bernstein's condition bounds it using variance: eλ2σ2/2(1λM/3)1\leq e^{\lambda^2\sigma^2/2 \cdot (1 - \lambda M/3)^{-1}}

Canonical Examples

Example

Estimating a coin's bias with Hoeffding

Flip a coin nn times. Each flip Xi{0,1}X_i \in \{0, 1\} with E[Xi]=p\mathbb{E}[X_i] = p. The sample mean Xˉn\bar{X}_n estimates pp. Hoeffding with a=0,b=1a = 0, b = 1 gives:

Pr[Xˉnpϵ]2e2nϵ2\Pr[|\bar{X}_n - p| \geq \epsilon] \leq 2e^{-2n\epsilon^2}

For ϵ=0.01\epsilon = 0.01 and δ=0.05\delta = 0.05: nlog(2/0.05)2(0.01)2=3.690.000218,445n \geq \frac{\log(2/0.05)}{2(0.01)^2} = \frac{3.69}{0.0002} \approx 18{,}445.

If we know p0.01p \approx 0.01 (rare event), Bernstein with σ2=p(1p)0.01\sigma^2 = p(1-p) \approx 0.01 gives roughly n3,700n \approx 3{,}700 — about 5 times fewer samples. Bernstein exploits the small variance.

Example

Comparing Markov, Chebyshev, and Hoeffding

Let X1,,X100X_1, \ldots, X_{100} be i.i.d. uniform on [0,1][0, 1]. We want Pr[Xˉ1000.50.1]\Pr[|\bar{X}_{100} - 0.5| \geq 0.1].

Markov (applied to Xˉ0.5|\bar{X} - 0.5|, crude): E[Xˉ0.5]/0.1\leq \mathbb{E}[|\bar{X} - 0.5|]/0.1. Since E[Xˉ0.5]Var(Xˉ)=1/12000.029\mathbb{E}[|\bar{X} - 0.5|] \leq \sqrt{\text{Var}(\bar{X})} = 1/\sqrt{1200} \approx 0.029, we get 0.29\leq 0.29.

Chebyshev: Var(Xˉ)/0.01=(1/121/100)/0.01=0.083\leq \text{Var}(\bar{X})/0.01 = (1/12 \cdot 1/100)/0.01 = 0.083.

Hoeffding: 2e21000.01=2e20.27\leq 2e^{-2 \cdot 100 \cdot 0.01} = 2e^{-2} \approx 0.27. Hoeffding gives 0.27, which is worse than Chebyshev's 0.083.

Why: the Hoeffding exponent here is 2nt2=22nt^2 = 2, so e20.135e^{-2} \approx 0.135 and the two-sided bound is 2e20.272e^{-2} \approx 0.27. Chebyshev's σ2/(nt2)=(1/12)/10.083\sigma^2/(n t^2) = (1/12)/1 \approx 0.083 is tighter because the exponent is simply not large enough for e2nt2e^{-2nt^2} to beat 1/t21/t^2 at this (n,t)(n, t). Hoeffding's exponential form dominates once 2nt22nt^2 is sufficiently larger than 2log(1/t2)2\log(1/t^2). For n=400n = 400 at the same t=0.1t = 0.1, Hoeffding gives 2e86.7×1042e^{-8} \approx 6.7 \times 10^{-4}, beating Chebyshev's 1/480.0211/48 \approx 0.021.

No single inequality is universally best at all sample sizes.

Example

Hoeffding in learning theory: the finite-class ERM bound

The ERM generalization bound for H<|\mathcal{H}| < \infty:

For each fixed hh, the loss (h(xi),yi)\ell(h(x_i), y_i) is a bounded random variable in [0,1][0, 1]. The empirical risk R^n(h)\hat{R}_n(h) is the sample mean. Hoeffding gives:

Pr[R(h)R^n(h)>ϵ]2e2nϵ2\Pr[|R(h) - \hat{R}_n(h)| > \epsilon] \leq 2e^{-2n\epsilon^2}

Union bound over all hHh \in \mathcal{H}:

Pr[suphR(h)R^n(h)>ϵ]2He2nϵ2\Pr[\sup_h |R(h) - \hat{R}_n(h)| > \epsilon] \leq 2|\mathcal{H}|e^{-2n\epsilon^2}

This is how Hoeffding feeds directly into generalization bounds.

Common Confusions

Watch Out

Hoeffding requires bounded random variables, not just bounded variance

Hoeffding's inequality assumes aiXibia_i \leq X_i \leq b_i almost surely. the random variables must be literally bounded. It does not apply to Gaussian random variables (which are unbounded), even though they have finite variance. For Gaussians, you use the exact sub-Gaussian tail Pr[Xμt]=Φ(t/σ)\Pr[X - \mu \geq t] = \Phi(-t/\sigma), or you can use the sub-Gaussian framework that generalizes Hoeffding. If someone applies Hoeffding to an unbounded variable, the result is invalid.

Watch Out

Bernstein is better than Hoeffding when variance is small, not always

Bernstein's bound is exp(nt2/(2σ2+2Mt/3))\exp(-nt^2/(2\sigma^2 + 2Mt/3)), while Hoeffding gives exp(2nt2/(ba)2)\exp(-2nt^2/(b-a)^2). If σ2(ba)2/4\sigma^2 \approx (b-a)^2/4 (i.e., the variance is as large as it can be for the given range), Bernstein offers little or no improvement. Bernstein shines when σ2(ba)2\sigma^2 \ll (b-a)^2. for example, when the random variable is usually near zero but has occasional large values. In the worst case over all distributions with given range, Hoeffding and Bernstein are comparable.

Watch Out

The two-sided Hoeffding bound is 2x the one-sided bound, not squared

A common error: writing Pr[Xˉμt]2e2nt2/(ba)2\Pr[|\bar{X} - \mu| \geq t] \leq 2e^{-2nt^2/(b-a)^2}. The factor of 2 comes from the union bound over the two tails: Pr[Xˉμt]Pr[Xˉμt]+Pr[μXˉt]\Pr[|\bar{X} - \mu| \geq t] \leq \Pr[\bar{X} - \mu \geq t] + \Pr[\mu - \bar{X} \geq t]. Students sometimes confuse this with squaring the one-sided probability.

Summary

  • Markov: Pr[Xt]E[X]/t\Pr[X \geq t] \leq \mathbb{E}[X]/t. weakest, needs only X0X \geq 0
  • Chebyshev: Pr[Xμt]σ2/t2\Pr[|X - \mu| \geq t] \leq \sigma^2/t^2. uses variance, 1/t21/t^2 decay
  • Hoeffding: Pr[Xˉnμt]2exp(2nt2/(ba)2)\Pr[|\bar{X}_n - \mu| \geq t] \leq 2\exp(-2nt^2/(b-a)^2) . exponential decay, needs boundedness
  • Bernstein: like Hoeffding but uses variance σ2\sigma^2. tighter when σ2(ba)2\sigma^2 \ll (b-a)^2
  • Exponential tails give log(1/δ)\log(1/\delta) dependence in sample complexity; polynomial tails give 1/δ1/\delta
  • The Chernoff/MGF method is the universal technique for deriving exponential bounds
  • In learning theory, Hoeffding is the default; use Bernstein when you have variance information

Exercises

ExerciseCore

Problem

Using Hoeffding's inequality, how many fair coin flips do you need to estimate the probability of heads to within ±0.02\pm 0.02 with probability at least 0.99?

ExerciseCore

Problem

Suppose X1,,XnX_1, \ldots, X_n are i.i.d. with Xi[0,1]X_i \in [0, 1], E[Xi]=0.01\mathbb{E}[X_i] = 0.01, and Var(Xi)=0.01\text{Var}(X_i) = 0.01. Compare the sample sizes needed by Hoeffding and Bernstein to guarantee Pr[Xˉn0.010.01]0.05\Pr[|\bar{X}_n - 0.01| \geq 0.01] \leq 0.05.

ExerciseAdvanced

Problem

Prove Hoeffding's lemma: if aXba \leq X \leq b and E[X]=0\mathbb{E}[X] = 0, then for all λ>0\lambda > 0:

E[eλX]exp ⁣(λ2(ba)28)\mathbb{E}[e^{\lambda X}] \leq \exp\!\left(\frac{\lambda^2(b-a)^2}{8}\right)

Related Comparisons

References

Canonical:

  • Boucheron, Lugosi, Massart, Concentration Inequalities: A Nonasymptotic Theory of Independence (2013), Chapters 2-3
  • Shalev-Shwartz & Ben-David, Understanding Machine Learning, Appendix B

Current:

  • Vershynin, High-Dimensional Probability (2018), Chapters 2-3

  • Wainwright, High-Dimensional Statistics (2019), Chapter 2

  • van Handel, Probability in High Dimension (2016), Chapters 1-3

Next Topics

Building on these basic inequalities:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics