Hoeffding's Lemma

Sneiderman, Robby

Concentration Probability

Hoeffding's Lemma

The MGF bound that powers Hoeffding's inequality: a centered random variable on [a,b] has sub-Gaussian moment generating function with parameter (b-a)^2/4.

CoreTier 1StableSupporting~30 min

Prerequisites

Concentration Inequalities Moment Generating Functions Chernoff Bounds Expectation Variance Covariance Moments

Prereq Map

Why This Matters

Hoeffding's inequality appears in the sample-complexity bound of every finite-class PAC argument on this site, and it is a one-line corollary of a single MGF estimate: Hoeffding's lemma. The lemma is the deeper result. It says that any zero-mean random variable supported on $[a, b]$ has a sub-Gaussian moment generating function with explicit parameter $(b - a)^2 / 4$ , and the proof exposes exactly which two facts do the work — the convexity of $e^{\lambda x}$ and the curvature bound on a specific log-sum-exp function.

Once you internalize the lemma, every bounded-variable concentration statement on the site reduces to:

write down the MGF bound from Hoeffding's lemma,
multiply across independent (or martingale) summands,
run the Chernoff method and optimize.

The bound $\frac{(b-a)^2}{8}$ in the exponent is not folklore. It is the exact best constant achievable from convexity plus a second-derivative bound of $1/4$ , and tracking where that $1/4$ comes from is what makes the proof rememberable.

Quick Version

For a centered random variable $X \in [a, b]$ almost surely,

$\mathbb{E}[e^{\lambda X}] \leq \exp\!\left(\frac{\lambda^2 (b-a)^2}{8}\right) \quad \text{for all } \lambda \in \mathbb{R}.$

In sub-Gaussian language: $X - \mathbb{E}[X]$ is sub-Gaussian with proxy variance $(b-a)^2/4$ . The constant $1/8$ is sharp in this convexity-based proof; the $1/4$ in the proxy variance is the maximum of $p(1-p)$ over $p \in [0,1]$ , and it appears as the Bernoulli $(1/2)$ worst case.

Mental Model

Three observations make the proof natural rather than mysterious.

Convexity of $e^{\lambda x}$ controls the worst-case mean MGF. Among all distributions on $[a, b]$ with a fixed mean, the one that maximizes $\mathbb{E}[e^{\lambda X}]$ is supported on the two endpoints $\{a, b\}$ . Convexity of $e^{\lambda x}$ implies that mass at the interior is always dominated by the corresponding two-point mass.
The two-point worst case is a tilted Bernoulli. After centering, the worst-case distribution puts mass $p = -a/(b-a)$ at $b$ and mass $1 - p$ at $a$ . Its log-MGF is the function $\psi(u) = -p u + \log(1 - p + p e^u)$ with $u = \lambda(b - a)$ .
Curvature is bounded by $1/4$ . A short calculation shows $\psi''(u) = q(u)(1 - q(u))$ where $q(u) \in [0, 1]$ , so $\psi''(u) \leq 1/4$ uniformly in $u$ and $p$ . Taylor's theorem with remainder then gives $\psi(u) \leq u^2 / 8$ , and substituting $u = \lambda(b - a)$ recovers Hoeffding's lemma exactly.

Recall: convexity of $e^{\lambda x}$ . For any convex $f$ and $\alpha \in [0,1]$ , $f(\alpha a + (1-\alpha) b) \leq \alpha f(a) + (1-\alpha) f(b)$ . Below we use $\alpha = (b - x) / (b - a)$ so that $x = \alpha a + (1 - \alpha) b$ and apply this to $f(x) = e^{\lambda x}$ .

Formal Setup

Let $X$ be a real-valued random variable with $\mathbb{E}[X] = 0$ and $a \leq X \leq b$ almost surely, where $a < 0 < b$ (the centered case forces $a < 0$ unless $X$ is degenerate at zero). Define the standardized parameter $u = \lambda (b - a)$ and the convex weight $p = -a / (b - a) \in [0, 1]$ .

Lemma

Hoeffding's Lemma

Statement

If $\mathbb{E}[X] = 0$ and $a \leq X \leq b$ almost surely, then for every $\lambda \in \mathbb{R}$ :

$\mathbb{E}[e^{\lambda X}] \leq \exp\!\left(\frac{\lambda^2 (b-a)^2}{8}\right).$

Exact statement

E [e^{λ X}] \leq exp (\frac{λ ^{2} ( b - a ) ^{2}}{8})

LaTeX source for copy/export

\mathbb{E}[e^{\lambda X}] \leq \exp\!\left(\frac{\lambda^2 (b-a)^2}{8}\right)

Intuition

Bounded support gives a uniform second-derivative bound on the log-MGF; that bound integrates twice to give the displayed $\lambda^2 (b-a)^2 / 8$ factor. The lemma compresses everything specific about the distribution into a single range parameter $b - a$ , which is why Hoeffding-type bounds care only about support width, not about the exact shape of the variable.

Proof Sketch

Step 1: convexity bound at the sample level. Since $X \in [a, b]$ , write $X = \alpha a + (1 - \alpha) b$ with $\alpha = (b - X)/(b - a) \in [0, 1]$ . Convexity of $e^{\lambda x}$ gives, pointwise,

$e^{\lambda X} \leq \frac{b - X}{b - a} e^{\lambda a} + \frac{X - a}{b - a} e^{\lambda b}.$

Step 2: take expectations. Using $\mathbb{E}[X] = 0$ ,

$\mathbb{E}[e^{\lambda X}] \leq \frac{b}{b - a} e^{\lambda a} + \frac{-a}{b - a} e^{\lambda b}.$

Step 3: rewrite in standardized form. Let $p = -a / (b - a)$ and $u = \lambda(b - a)$ . The right-hand side becomes $(1 - p) e^{-pu} + p e^{(1-p) u} = e^{\psi(u)}$ with

$\psi(u) = -p u + \log\!\bigl(1 - p + p e^u\bigr).$

Step 4: bound $\psi$ by $u^2/8$ . Differentiating,

$\psi'(u) = -p + \frac{p e^u}{1 - p + p e^u} = q(u) - p, \qquad q(u) := \frac{p e^u}{1 - p + p e^u} \in [0, 1].$

A second differentiation gives $\psi''(u) = q(u)(1 - q(u)) \leq 1/4$ , since $q(1-q)$ is maximized at $q = 1/2$ . Combined with $\psi(0) = 0$ and $\psi'(0) = 0$ , Taylor's theorem with integral remainder yields

$\psi(u) = \int_0^u (u - s)\, \psi''(s)\, ds \leq \int_0^u (u - s) \cdot \tfrac{1}{4}\, ds = \frac{u^2}{8}.$

Step 5: substitute back. Replacing $u$ by $\lambda(b - a)$ and exponentiating gives the stated bound.

Why It Matters

Every Hoeffding-style concentration statement on this site, including the finite-sum form, the sample-mean form, the Azuma-Hoeffding martingale extension, McDiarmid's bounded-differences inequality, and the finite-class ERM generalization bound, plugs Hoeffding's lemma into the Chernoff method at the per-summand step.

The lemma also says $X - \mathbb{E}[X]$ is sub-Gaussian with proxy variance $(b-a)^2/4$ . That places it inside the sub-Gaussian framework, where additivity of proxy variances under independent sums is built in.

Failure Mode

The lemma needs an almost-sure boundedness assumption. For unbounded variables with finite variance the bound is false in general; use a sub-Gaussian or sub-exponential MGF estimate instead. The constant $1/8$ is sharp for this convexity-based proof but is not the best possible for every distribution: a Bernoulli $(1/2)$ saturates it, while a uniform on $[-1, 1]$ has a strictly smaller proxy variance under the variance-aware form ( $(b-a)^2 / 12$ instead of $(b-a)^2 / 4$ ).

report a correction →

From the Lemma to Hoeffding's Inequality

The lemma is the only nontrivial step in the proof of Hoeffding's inequality. The rest is the Chernoff method plus independence.

Corollary

Hoeffding's Inequality from the Lemma

Statement

Let $X_1, \ldots, X_n$ be independent with $a_i \leq X_i \leq b_i$ almost surely, and let $\mu = \mathbb{E}[\bar{X}_n]$ . For every $t > 0$ :

$\Pr\!\left[\bar{X}_n - \mu \geq t\right] \leq \exp\!\left(-\frac{2 n^2 t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right).$

A union bound over the two tails gives

$\Pr\!\left[|\bar{X}_n - \mu| \geq t\right] \leq 2 \exp\!\left(-\frac{2 n^2 t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right).$

For the special case $X_i \in [a, b]$ identically bounded, the bound collapses to $2 \exp\!\bigl(-2 n t^2 / (b-a)^2\bigr)$ , and for $X_i \in [0, 1]$ to $2 \exp\!\bigl(-2 n t^2\bigr)$ .

Exact statement

Pr [\overset{ˉ}{X}_{n} - μ \geq t] \leq exp (- \frac{2 n ^{2} t ^{2}}{\sum _{i = 1}^{n} ( b _{i} - a _{i} ) ^{2}})

LaTeX source for copy/export

\Pr\!\left[\bar{X}_n - \mu \geq t\right] \leq \exp\!\left(-\frac{2 n^2 t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right)

Proof Sketch

Step 1: center. Set $Y_i = X_i - \mathbb{E}[X_i]$ . Each $Y_i$ has zero mean and lies in $[a_i - \mathbb{E}[X_i],\, b_i - \mathbb{E}[X_i]]$ , which is an interval of width $b_i - a_i$ .

Step 2: Chernoff for the centered sum. For any $\lambda > 0$ ,

$\Pr\!\left[\sum_i Y_i \geq n t\right] \leq e^{-\lambda n t}\, \mathbb{E}\!\left[e^{\lambda \sum_i Y_i}\right] = e^{-\lambda n t}\, \prod_{i=1}^n \mathbb{E}[e^{\lambda Y_i}],$

using independence to factor the MGF.

Step 3: apply Hoeffding's lemma to each factor. Each $Y_i$ is centered on an interval of width $b_i - a_i$ , so

$\mathbb{E}[e^{\lambda Y_i}] \leq \exp\!\left(\frac{\lambda^2 (b_i - a_i)^2}{8}\right).$

Multiplying gives $\mathbb{E}[e^{\lambda \sum_i Y_i}] \leq \exp\!\bigl(\lambda^2 v / 8\bigr)$ with $v = \sum_i (b_i - a_i)^2$ .

Step 4: optimize. The exponent is $-\lambda n t + \lambda^2 v / 8$ . The optimum is $\lambda^* = 4 n t / v$ , giving exponent $-2 n^2 t^2 / v$ . The two-sided bound applies the same argument to $-Y_i$ and union-bounds the two tails, costing a factor of two.

Why It Matters

This is the bridge between the abstract MGF estimate and the finite-sample PAC bounds. Setting $t = \epsilon$ and inverting the exponential gives the sample-complexity statement $n \geq (b-a)^2 \log(2/\delta) / (2 \epsilon^2)$ that appears in every introductory treatment of finite-class learning.

Failure Mode

The bound uses $b_i - a_i$ as a worst-case range. When a variable's variance is much smaller than $(b_i - a_i)^2 / 4$ , Hoeffding wastes the gap; Bernstein and Bennett close that gap by using variance information directly.

report a correction →

Bridge: Polynomial vs. Exponential Tails

Hoeffding's lemma is what changes the exponent from polynomial to exponential. The same sample mean $\bar{X}_n$ admits both a Chebyshev bound $\sigma^2 / (n t^2)$ and a Hoeffding bound $\exp\bigl(-2 n t^2 / (b-a)^2\bigr)$ , and the second decays infinitely faster in $n$ for any fixed $t$ . The practical effect: the sample size needed to certify $\Pr[|\bar{X}_n - \mu| \geq \epsilon] \leq \delta$ is $O(1 / (\delta \epsilon^2))$ under Chebyshev and $O(\log(1/\delta) / \epsilon^2)$ under Hoeffding. This is the reason every PAC sample-complexity statement carries a $\log(1/\delta)$ inside instead of a $1/\delta$ .

Common Confusions

Watch Out

Hoeffding's lemma needs zero mean

The bound $\mathbb{E}[e^{\lambda X}] \leq e^{\lambda^2 (b-a)^2 / 8}$ assumes $\mathbb{E}[X] = 0$ . For a non-centered $X \in [a, b]$ , the correct statement is that $X - \mathbb{E}[X]$ has the bounded MGF, so the standard application is to apply the lemma to the centered variable, not to $X$ itself.

Watch Out

The constant is 1/8, not 1/2

The proxy variance of a centered $[a, b]$ -bounded variable is $(b - a)^2 / 4$ . Inside the MGF the exponent has $\lambda^2 (b - a)^2 / 8$ , with the extra factor $1/2$ coming from the Gaussian-style $\lambda^2 \sigma^2 / 2$ scaling. Equivalently, the curvature bound $\psi''(u) \leq 1/4$ integrates to $u^2/8$ . A common error is to write $\lambda^2 (b - a)^2 / 4$ in the exponent; that would correspond to a sub-Gaussian variance proxy of $(b - a)^2 / 2$ , which is twice the correct value.

Watch Out

Sharper bounds exist when the variance is small

Hoeffding's lemma only sees the range. A variable concentrated near zero with the same support $[a, b]$ has the same Hoeffding bound, even though its true MGF is much smaller. Bennett's lemma replaces $(b-a)^2/8$ with a variance-driven exponent that is sharper exactly in this regime; see Bennett's inequality.

Exercises

ExerciseCore

Problem

Verify the curvature bound $\psi''(u) \leq 1/4$ used in step 4 of the proof. Specifically, show that $\psi''(u) = q(u)(1 - q(u))$ with $q(u) = p e^u / (1 - p + p e^u)$ , and that $q(1 - q) \leq 1/4$ for every $q \in [0, 1]$ .

ExerciseCore

Problem

Apply Hoeffding's lemma to a centered Rademacher variable $\sigma$ with $\Pr[\sigma = 1] = \Pr[\sigma = -1] = 1/2$ . Compute the resulting MGF bound and compare it to the exact MGF $\mathbb{E}[e^{\lambda \sigma}] = \cosh(\lambda)$ .

ExerciseAdvanced

Problem

Use Hoeffding's lemma to prove the following one-sided inequality. Let $X_1, \ldots, X_n$ be i.i.d. with $X_i \in [0, 1]$ and $\mathbb{E}[X_i] = \mu$ . Show that for every $t \geq 0$ ,

$\Pr[\bar{X}_n \geq \mu + t] \leq \exp(-2 n t^2).$

State explicitly which step of the derivation invokes the lemma.

References

Canonical:

Hoeffding, W. (1963). "Probability Inequalities for Sums of Bounded Random Variables." Journal of the American Statistical Association, 58(301), 13-30. The original lemma is stated mid-paper as the MGF bound on which Hoeffding's inequality rests.
Boucheron, S., Lugosi, G., & Massart, P. (2013). Concentration Inequalities. Oxford University Press. Lemma 2.2 (Hoeffding's lemma) and Theorem 2.8 (Hoeffding's inequality).
Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning. Cambridge University Press. Lemma B.7 (Hoeffding's lemma) and Lemma B.6 (Hoeffding's inequality) in Appendix B.

Current:

Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. Section 2.1.2 derives the lemma in the sub-Gaussian framework.
Vershynin, R. (2018). High-Dimensional Probability. Cambridge University Press. Theorem 2.2.6 (Hoeffding's lemma) with discussion of sharpness.
van Handel, R. (2016). Probability in High Dimension. Lecture notes, Princeton. Chapter 3 develops Hoeffding's lemma alongside the broader sub-Gaussian framework.

Next Topics

Hoeffding's inequality: the corollary that almost every PAC sample-complexity bound invokes
Bennett's inequality: variance-aware MGF bound, sharper when the variance is much smaller than the squared range
Bernstein's inequality: the sample-mean form that follows from Bennett via $h(a) \geq a^2 / (2 + 2a/3)$
Sub-Gaussian random variables: the abstract framework around the proxy variance $(b-a)^2 / 4$
Azuma-Hoeffding (martingale extension): extension to martingale differences with bounded increments
McDiarmid's inequality: bounded-differences concentration for functions of independent variables

Last reviewed: May 8, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

6

Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
Sets, Functions, and Relationslayer 0A · tier 1
Chernoff Boundslayer 1 · tier 1
Concentration Inequalitieslayer 1 · tier 1
Basic Logic and Proof Techniqueslayer 0A · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.