Concentration Probability
Concentration Inequalities
Bounds on how far random variables deviate from their expectations: Markov, Chebyshev, Hoeffding, and Bernstein. Used throughout generalization theory, bandits, and sample complexity.
Why This Matters
Concentration inequalities are the backbone of statistical learning theory. Every generalization bound, every sample complexity result, every PAC-learning theorem ultimately rests on a statement of the form: "the empirical average is close to the true expectation with high probability."
When you see a bound like in ERM theory, the proof always invokes a concentration inequality. Without these tools, you cannot prove anything about learning from finite data.
Tail Probability Decay: P(X > t) on log scale
These are the four inequalities you will use most often, in order of increasing power and increasing assumptions:
- Markov: requires only non-negativity
- Chebyshev: requires finite variance
- Hoeffding: requires bounded random variables
- Bernstein: requires bounded range and uses variance information
Mental Model
Think of concentration inequalities as answering one question: how likely is it that the sample average deviates from by more than ?
Each inequality trades assumptions for power:
- Markov/Chebyshev give polynomial tails (): slow decay.
- Hoeffding/Bernstein give exponential tails (): fast decay.
The exponential tail bounds are what make learning theory work. With polynomial tails, you need samples. With exponential tails, you need : the dependence on the confidence drops inside a logarithm.
Formal Setup and Notation
Let be independent random variables. We write for the sample mean and for its expectation (which equals when the are identically distributed).
We seek upper bounds on the tail probability:
or equivalently the one-sided tail .
Core Definitions
Tail Probability
The tail probability of a random variable at level is . Concentration inequalities provide upper bounds on tail probabilities, showing that is unlikely to deviate far from its mean. A bound is useful when it decays rapidly in .
Sub-Gaussian Behavior
A random variable exhibits sub-Gaussian behavior if its tail probability decays like for some constant . Bounded random variables are sub-Gaussian (this is Hoeffding). The Gaussian decay rate is the "gold standard" for tail bounds. It means large deviations are exponentially unlikely.
Moment Generating Function
The moment generating function (MGF) of is . The MGF method — bounding the MGF and then optimizing over — is the standard technique for deriving exponential concentration inequalities. It is also called the Chernoff bounding method (Chernoff 1952); Wainwright (2019) uses this terminology. Cramér's theorem is a distinct, asymptotic result about the large-deviations rate function.
Main Theorems
Markov's Inequality
Statement
If , then for any :
Intuition
If the average value of is small, then cannot be large too often. If , then can be at most 1% of the time; otherwise the average would be too high.
Proof Sketch
. Divide both sides by .
Why It Matters
Markov's inequality is the foundation that all other concentration inequalities build upon. Chebyshev applies Markov to . Hoeffding and Bernstein apply Markov to . It is the weakest bound but requires the fewest assumptions.
Failure Mode
The bound decays only as , polynomially. This is far too slow for most applications. If , Markov gives , independent of any variance or boundedness structure. For well-behaved distributions the true probability is astronomically smaller. Markov uses only ; it cannot see . Chebyshev is the first inequality that brings variance information into the picture. You almost always want Hoeffding or Bernstein when those assumptions hold.
Chebyshev's Inequality
Statement
For any random variable with mean and variance , for any :
For the sample mean of i.i.d. variables with variance :
Intuition
Chebyshev uses variance information: if the spread of around its mean is small ( is small), then large deviations are unlikely. Applied to the sample mean, the variance decreases as , giving the familiar concentration.
Proof Sketch
Apply Markov's inequality to the non-negative random variable :
Why It Matters
Chebyshev gives the first quantitative form of the law of large numbers: with probability , . This requires for an -accurate estimate. Note the dependence, not .
Failure Mode
The decay is still polynomial. Chebyshev cannot give you the dependence needed for efficient learning. For bounded random variables, Hoeffding gives exponentially better tails. Chebyshev is most useful when you know the variance but nothing about higher moments or boundedness.
Hoeffding's Inequality
Statement
Let be independent random variables with almost surely. Then for any :
For the two-sided version:
Special case: If all (identically bounded):
Intuition
Bounded random variables have sub-Gaussian tails. The width of the bounding interval controls the "effective variance" of each variable. The bound says: the probability of a deviation of size decays exponentially in , with the rate determined by how spread out the bounds are.
Proof Sketch
The proof uses the Chernoff/MGF method:
-
For any :
-
By independence:
-
Hoeffding's lemma: If , then
-
Combine and optimize over (set ).
Why It Matters
Hoeffding is the workhorse of learning theory. The ERM generalization bound for finite classes, the uniform convergence bound, and many other results use Hoeffding at their core. The exponential tail gives . The dependence is what makes high-confidence bounds feasible.
Failure Mode
Hoeffding uses only the range , not the actual variance. If the variance is much smaller than (e.g., a variable that is usually 0 but can be as large as 1), Hoeffding is loose. Bernstein fixes this by incorporating variance information.
Bernstein's Inequality
Statement
Let be independent with , , and almost surely. Then for any :
For the sample mean with i.i.d. variables ():
Intuition
Bernstein interpolates between two regimes. For small deviations (), the term dominates the denominator and the bound looks like , a variance-dependent Gaussian tail. For large deviations (), the term dominates and the bound looks like , a sub-exponential tail. The variance term makes Bernstein much tighter than Hoeffding when the variance is small relative to the range.
Proof Sketch
Like Hoeffding, use the Chernoff method. The key improvement is a tighter bound on the MGF:
-
If and , then for .
-
This bound uses the variance instead of the crude range bound that Hoeffding uses.
-
Multiply over independent variables and optimize .
Why It Matters
Bernstein is the right inequality when you have variance information. In learning theory, this matters for sparse or low-noise settings. For example, if the loss is usually close to zero (low empirical risk, low variance), Bernstein gives much tighter bounds than Hoeffding, because Hoeffding only knows the loss is in while Bernstein knows the variance is small.
Failure Mode
Bernstein requires knowing (or bounding) the variance, which is not always available. If you only know the range , Hoeffding is simpler and sufficient. In practice, you can sometimes use an empirical estimate of the variance and apply Bernstein, but this introduces additional technical complications (you need a concentration bound on the variance estimate itself).
Proof Ideas and Templates Used
All four inequalities follow a common escalation pattern:
-
Markov's trick: the universal first step: convert a tail probability into an expectation bound via for any increasing .
-
Moment method: apply Markov to to get . Chebyshev is the case.
-
MGF/Chernoff method: apply Markov to for optimal . This is exponentially stronger because grows much faster than any polynomial. Both Hoeffding and Bernstein use this.
The key lemma in each exponential bound is controlling :
- Hoeffding's lemma bounds it using only boundedness:
- Bernstein's condition bounds it using variance:
Canonical Examples
Estimating a coin's bias with Hoeffding
Flip a coin times. Each flip with . The sample mean estimates . Hoeffding with gives:
For and : .
If we know (rare event), Bernstein with gives roughly — about 5 times fewer samples. Bernstein exploits the small variance.
Comparing Markov, Chebyshev, and Hoeffding
Let be i.i.d. uniform on . We want .
Markov (applied to , crude): . Since , we get .
Chebyshev: .
Hoeffding: . Hoeffding gives 0.27, which is worse than Chebyshev's 0.083.
Why: the Hoeffding exponent here is , so and the two-sided bound is . Chebyshev's is tighter because the exponent is simply not large enough for to beat at this . Hoeffding's exponential form dominates once is sufficiently larger than . For at the same , Hoeffding gives , beating Chebyshev's .
No single inequality is universally best at all sample sizes.
Hoeffding in learning theory: the finite-class ERM bound
The ERM generalization bound for :
For each fixed , the loss is a bounded random variable in . The empirical risk is the sample mean. Hoeffding gives:
Union bound over all :
This is how Hoeffding feeds directly into generalization bounds.
Common Confusions
Hoeffding requires bounded random variables, not just bounded variance
Hoeffding's inequality assumes almost surely. the random variables must be literally bounded. It does not apply to Gaussian random variables (which are unbounded), even though they have finite variance. For Gaussians, you use the exact sub-Gaussian tail , or you can use the sub-Gaussian framework that generalizes Hoeffding. If someone applies Hoeffding to an unbounded variable, the result is invalid.
Bernstein is better than Hoeffding when variance is small, not always
Bernstein's bound is , while Hoeffding gives . If (i.e., the variance is as large as it can be for the given range), Bernstein offers little or no improvement. Bernstein shines when . for example, when the random variable is usually near zero but has occasional large values. In the worst case over all distributions with given range, Hoeffding and Bernstein are comparable.
The two-sided Hoeffding bound is 2x the one-sided bound, not squared
A common error: writing . The factor of 2 comes from the union bound over the two tails: . Students sometimes confuse this with squaring the one-sided probability.
Summary
- Markov: . weakest, needs only
- Chebyshev: . uses variance, decay
- Hoeffding: . exponential decay, needs boundedness
- Bernstein: like Hoeffding but uses variance . tighter when
- Exponential tails give dependence in sample complexity; polynomial tails give
- The Chernoff/MGF method is the universal technique for deriving exponential bounds
- In learning theory, Hoeffding is the default; use Bernstein when you have variance information
Exercises
Problem
Using Hoeffding's inequality, how many fair coin flips do you need to estimate the probability of heads to within with probability at least 0.99?
Problem
Suppose are i.i.d. with , , and . Compare the sample sizes needed by Hoeffding and Bernstein to guarantee .
Problem
Prove Hoeffding's lemma: if and , then for all :
Related Comparisons
References
Canonical:
- Boucheron, Lugosi, Massart, Concentration Inequalities: A Nonasymptotic Theory of Independence (2013), Chapters 2-3
- Shalev-Shwartz & Ben-David, Understanding Machine Learning, Appendix B
Current:
-
Vershynin, High-Dimensional Probability (2018), Chapters 2-3
-
Wainwright, High-Dimensional Statistics (2019), Chapter 2
-
van Handel, Probability in High Dimension (2016), Chapters 1-3
Next Topics
Building on these basic inequalities:
- Sub-Gaussian random variables: the general framework unifying Hoeffding-type bounds
- Sub-exponential random variables: handling heavier tails (chi-squared, exponential distributions)
- McDiarmid's inequality: concentration for functions of independent variables, not just sums
- Contraction inequality: how Lipschitz transformations preserve concentration
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
Builds on This
- Algorithmic StabilityLayer 3
- Chernoff BoundsLayer 1
- Empirical Risk MinimizationLayer 2
- Epsilon-Nets and Covering NumbersLayer 3
- Markov Decision ProcessesLayer 2
- Matrix ConcentrationLayer 3
- McDiarmid's InequalityLayer 3
- Minimax Lower BoundsLayer 3
- PAC Learning FrameworkLayer 1
- Rademacher ComplexityLayer 3
- Stochastic Gradient Descent ConvergenceLayer 2
- Sub-Exponential Random VariablesLayer 2
- Sub-Gaussian Random VariablesLayer 2
- Symmetrization InequalityLayer 3
- VC DimensionLayer 2