Concentration Probability
Hoeffding's Lemma
The MGF bound that powers Hoeffding's inequality: a centered random variable on [a,b] has sub-Gaussian moment generating function with parameter (b-a)^2/4.
Prerequisites
Why This Matters
Hoeffding's inequality appears in the sample-complexity bound of every finite-class PAC argument on this site, and it is a one-line corollary of a single MGF estimate: Hoeffding's lemma. The lemma is the deeper result. It says that any zero-mean random variable supported on has a sub-Gaussian moment generating function with explicit parameter , and the proof exposes exactly which two facts do the work — the convexity of and the curvature bound on a specific log-sum-exp function.
Once you internalize the lemma, every bounded-variable concentration statement on the site reduces to:
- write down the MGF bound from Hoeffding's lemma,
- multiply across independent (or martingale) summands,
- run the Chernoff method and optimize.
The bound in the exponent is not folklore. It is the exact best constant achievable from convexity plus a second-derivative bound of , and tracking where that comes from is what makes the proof rememberable.
Quick Version
For a centered random variable almost surely,
In sub-Gaussian language: is sub-Gaussian with proxy variance . The constant is sharp in this convexity-based proof; the in the proxy variance is the maximum of over , and it appears as the Bernoulli worst case.
Mental Model
Three observations make the proof natural rather than mysterious.
- Convexity of controls the worst-case mean MGF. Among all distributions on with a fixed mean, the one that maximizes is supported on the two endpoints . Convexity of implies that mass at the interior is always dominated by the corresponding two-point mass.
- The two-point worst case is a tilted Bernoulli. After centering, the worst-case distribution puts mass at and mass at . Its log-MGF is the function with .
- Curvature is bounded by . A short calculation shows where , so uniformly in and . Taylor's theorem with remainder then gives , and substituting recovers Hoeffding's lemma exactly.
Recall: convexity of . For any convex and , . Below we use so that and apply this to .
Formal Setup
Let be a real-valued random variable with and almost surely, where (the centered case forces unless is degenerate at zero). Define the standardized parameter and the convex weight .
Hoeffding's Lemma
Statement
If and almost surely, then for every :
Exact statement
LaTeX source for copy/export
\mathbb{E}[e^{\lambda X}] \leq \exp\!\left(\frac{\lambda^2 (b-a)^2}{8}\right)Intuition
Bounded support gives a uniform second-derivative bound on the log-MGF; that bound integrates twice to give the displayed factor. The lemma compresses everything specific about the distribution into a single range parameter , which is why Hoeffding-type bounds care only about support width, not about the exact shape of the variable.
Proof Sketch
Step 1: convexity bound at the sample level. Since , write with . Convexity of gives, pointwise,
Step 2: take expectations. Using ,
Step 3: rewrite in standardized form. Let and . The right-hand side becomes with
Step 4: bound by . Differentiating,
A second differentiation gives , since is maximized at . Combined with and , Taylor's theorem with integral remainder yields
Step 5: substitute back. Replacing by and exponentiating gives the stated bound.
Why It Matters
Every Hoeffding-style concentration statement on this site, including the finite-sum form, the sample-mean form, the Azuma-Hoeffding martingale extension, McDiarmid's bounded-differences inequality, and the finite-class ERM generalization bound, plugs Hoeffding's lemma into the Chernoff method at the per-summand step.
The lemma also says is sub-Gaussian with proxy variance . That places it inside the sub-Gaussian framework, where additivity of proxy variances under independent sums is built in.
Failure Mode
The lemma needs an almost-sure boundedness assumption. For unbounded variables with finite variance the bound is false in general; use a sub-Gaussian or sub-exponential MGF estimate instead. The constant is sharp for this convexity-based proof but is not the best possible for every distribution: a Bernoulli saturates it, while a uniform on has a strictly smaller proxy variance under the variance-aware form ( instead of ).
From the Lemma to Hoeffding's Inequality
The lemma is the only nontrivial step in the proof of Hoeffding's inequality. The rest is the Chernoff method plus independence.
Hoeffding's Inequality from the Lemma
Statement
Let be independent with almost surely, and let . For every :
A union bound over the two tails gives
For the special case identically bounded, the bound collapses to , and for to .
Exact statement
LaTeX source for copy/export
\Pr\!\left[\bar{X}_n - \mu \geq t\right] \leq \exp\!\left(-\frac{2 n^2 t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right)Proof Sketch
Step 1: center. Set . Each has zero mean and lies in , which is an interval of width .
Step 2: Chernoff for the centered sum. For any ,
using independence to factor the MGF.
Step 3: apply Hoeffding's lemma to each factor. Each is centered on an interval of width , so
Multiplying gives with .
Step 4: optimize. The exponent is . The optimum is , giving exponent . The two-sided bound applies the same argument to and union-bounds the two tails, costing a factor of two.
Why It Matters
This is the bridge between the abstract MGF estimate and the finite-sample PAC bounds. Setting and inverting the exponential gives the sample-complexity statement that appears in every introductory treatment of finite-class learning.
Failure Mode
The bound uses as a worst-case range. When a variable's variance is much smaller than , Hoeffding wastes the gap; Bernstein and Bennett close that gap by using variance information directly.
Bridge: Polynomial vs. Exponential Tails
Hoeffding's lemma is what changes the exponent from polynomial to exponential. The same sample mean admits both a Chebyshev bound and a Hoeffding bound , and the second decays infinitely faster in for any fixed . The practical effect: the sample size needed to certify is under Chebyshev and under Hoeffding. This is the reason every PAC sample-complexity statement carries a inside instead of a .
Common Confusions
Hoeffding's lemma needs zero mean
The bound assumes . For a non-centered , the correct statement is that has the bounded MGF, so the standard application is to apply the lemma to the centered variable, not to itself.
The constant is 1/8, not 1/2
The proxy variance of a centered -bounded variable is . Inside the MGF the exponent has , with the extra factor coming from the Gaussian-style scaling. Equivalently, the curvature bound integrates to . A common error is to write in the exponent; that would correspond to a sub-Gaussian variance proxy of , which is twice the correct value.
Sharper bounds exist when the variance is small
Hoeffding's lemma only sees the range. A variable concentrated near zero with the same support has the same Hoeffding bound, even though its true MGF is much smaller. Bennett's lemma replaces with a variance-driven exponent that is sharper exactly in this regime; see Bennett's inequality.
Exercises
Problem
Verify the curvature bound used in step 4 of the proof. Specifically, show that with , and that for every .
Problem
Apply Hoeffding's lemma to a centered Rademacher variable with . Compute the resulting MGF bound and compare it to the exact MGF .
Problem
Use Hoeffding's lemma to prove the following one-sided inequality. Let be i.i.d. with and . Show that for every ,
State explicitly which step of the derivation invokes the lemma.
References
Canonical:
- Hoeffding, W. (1963). "Probability Inequalities for Sums of Bounded Random Variables." Journal of the American Statistical Association, 58(301), 13-30. The original lemma is stated mid-paper as the MGF bound on which Hoeffding's inequality rests.
- Boucheron, S., Lugosi, G., & Massart, P. (2013). Concentration Inequalities. Oxford University Press. Lemma 2.2 (Hoeffding's lemma) and Theorem 2.8 (Hoeffding's inequality).
- Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning. Cambridge University Press. Lemma B.7 (Hoeffding's lemma) and Lemma B.6 (Hoeffding's inequality) in Appendix B.
Current:
- Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. Section 2.1.2 derives the lemma in the sub-Gaussian framework.
- Vershynin, R. (2018). High-Dimensional Probability. Cambridge University Press. Theorem 2.2.6 (Hoeffding's lemma) with discussion of sharpness.
- van Handel, R. (2016). Probability in High Dimension. Lecture notes, Princeton. Chapter 3 develops Hoeffding's lemma alongside the broader sub-Gaussian framework.
Next Topics
- Hoeffding's inequality: the corollary that almost every PAC sample-complexity bound invokes
- Bennett's inequality: variance-aware MGF bound, sharper when the variance is much smaller than the squared range
- Bernstein's inequality: the sample-mean form that follows from Bennett via
- Sub-Gaussian random variables: the abstract framework around the proxy variance
- Azuma-Hoeffding (martingale extension): extension to martingale differences with bounded increments
- McDiarmid's inequality: bounded-differences concentration for functions of independent variables
Last reviewed: May 8, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
6- Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
- Sets, Functions, and Relationslayer 0A · tier 1
- Chernoff Boundslayer 1 · tier 1
- Concentration Inequalitieslayer 1 · tier 1
- Basic Logic and Proof Techniqueslayer 0A · tier 2
Derived topics
0No published topic currently declares this as a prerequisite.