Skip to main content

Concentration Probability

Hoeffding's Lemma

The MGF bound that powers Hoeffding's inequality: a centered random variable on [a,b] has sub-Gaussian moment generating function with parameter (b-a)^2/4.

CoreTier 1StableSupporting~30 min

Why This Matters

Hoeffding's inequality appears in the sample-complexity bound of every finite-class PAC argument on this site, and it is a one-line corollary of a single MGF estimate: Hoeffding's lemma. The lemma is the deeper result. It says that any zero-mean random variable supported on [a,b][a, b] has a sub-Gaussian moment generating function with explicit parameter (ba)2/4(b - a)^2 / 4, and the proof exposes exactly which two facts do the work — the convexity of eλxe^{\lambda x} and the curvature bound on a specific log-sum-exp function.

Once you internalize the lemma, every bounded-variable concentration statement on the site reduces to:

  1. write down the MGF bound from Hoeffding's lemma,
  2. multiply across independent (or martingale) summands,
  3. run the Chernoff method and optimize.

The bound (ba)28\frac{(b-a)^2}{8} in the exponent is not folklore. It is the exact best constant achievable from convexity plus a second-derivative bound of 1/41/4, and tracking where that 1/41/4 comes from is what makes the proof rememberable.

Quick Version

For a centered random variable X[a,b]X \in [a, b] almost surely,

E[eλX]exp ⁣(λ2(ba)28)for all λR.\mathbb{E}[e^{\lambda X}] \leq \exp\!\left(\frac{\lambda^2 (b-a)^2}{8}\right) \quad \text{for all } \lambda \in \mathbb{R}.

In sub-Gaussian language: XE[X]X - \mathbb{E}[X] is sub-Gaussian with proxy variance (ba)2/4(b-a)^2/4. The constant 1/81/8 is sharp in this convexity-based proof; the 1/41/4 in the proxy variance is the maximum of p(1p)p(1-p) over p[0,1]p \in [0,1], and it appears as the Bernoulli(1/2)(1/2) worst case.

Mental Model

Three observations make the proof natural rather than mysterious.

  1. Convexity of eλxe^{\lambda x} controls the worst-case mean MGF. Among all distributions on [a,b][a, b] with a fixed mean, the one that maximizes E[eλX]\mathbb{E}[e^{\lambda X}] is supported on the two endpoints {a,b}\{a, b\}. Convexity of eλxe^{\lambda x} implies that mass at the interior is always dominated by the corresponding two-point mass.
  2. The two-point worst case is a tilted Bernoulli. After centering, the worst-case distribution puts mass p=a/(ba)p = -a/(b-a) at bb and mass 1p1 - p at aa. Its log-MGF is the function ψ(u)=pu+log(1p+peu)\psi(u) = -p u + \log(1 - p + p e^u) with u=λ(ba)u = \lambda(b - a).
  3. Curvature is bounded by 1/41/4. A short calculation shows ψ(u)=q(u)(1q(u))\psi''(u) = q(u)(1 - q(u)) where q(u)[0,1]q(u) \in [0, 1], so ψ(u)1/4\psi''(u) \leq 1/4 uniformly in uu and pp. Taylor's theorem with remainder then gives ψ(u)u2/8\psi(u) \leq u^2 / 8, and substituting u=λ(ba)u = \lambda(b - a) recovers Hoeffding's lemma exactly.

Recall: convexity of eλxe^{\lambda x}. For any convex ff and α[0,1]\alpha \in [0,1], f(αa+(1α)b)αf(a)+(1α)f(b)f(\alpha a + (1-\alpha) b) \leq \alpha f(a) + (1-\alpha) f(b). Below we use α=(bx)/(ba)\alpha = (b - x) / (b - a) so that x=αa+(1α)bx = \alpha a + (1 - \alpha) b and apply this to f(x)=eλxf(x) = e^{\lambda x}.

Formal Setup

Let XX be a real-valued random variable with E[X]=0\mathbb{E}[X] = 0 and aXba \leq X \leq b almost surely, where a<0<ba < 0 < b (the centered case forces a<0a < 0 unless XX is degenerate at zero). Define the standardized parameter u=λ(ba)u = \lambda (b - a) and the convex weight p=a/(ba)[0,1]p = -a / (b - a) \in [0, 1].

Lemma

Hoeffding's Lemma

Statement

If E[X]=0\mathbb{E}[X] = 0 and aXba \leq X \leq b almost surely, then for every λR\lambda \in \mathbb{R}:

E[eλX]exp ⁣(λ2(ba)28).\mathbb{E}[e^{\lambda X}] \leq \exp\!\left(\frac{\lambda^2 (b-a)^2}{8}\right).

Exact statement

LaTeX source for copy/export

\mathbb{E}[e^{\lambda X}] \leq \exp\!\left(\frac{\lambda^2 (b-a)^2}{8}\right)

Intuition

Bounded support gives a uniform second-derivative bound on the log-MGF; that bound integrates twice to give the displayed λ2(ba)2/8\lambda^2 (b-a)^2 / 8 factor. The lemma compresses everything specific about the distribution into a single range parameter bab - a, which is why Hoeffding-type bounds care only about support width, not about the exact shape of the variable.

Proof Sketch

Step 1: convexity bound at the sample level. Since X[a,b]X \in [a, b], write X=αa+(1α)bX = \alpha a + (1 - \alpha) b with α=(bX)/(ba)[0,1]\alpha = (b - X)/(b - a) \in [0, 1]. Convexity of eλxe^{\lambda x} gives, pointwise,

eλXbXbaeλa+Xabaeλb.e^{\lambda X} \leq \frac{b - X}{b - a} e^{\lambda a} + \frac{X - a}{b - a} e^{\lambda b}.

Step 2: take expectations. Using E[X]=0\mathbb{E}[X] = 0,

E[eλX]bbaeλa+abaeλb.\mathbb{E}[e^{\lambda X}] \leq \frac{b}{b - a} e^{\lambda a} + \frac{-a}{b - a} e^{\lambda b}.

Step 3: rewrite in standardized form. Let p=a/(ba)p = -a / (b - a) and u=λ(ba)u = \lambda(b - a). The right-hand side becomes (1p)epu+pe(1p)u=eψ(u)(1 - p) e^{-pu} + p e^{(1-p) u} = e^{\psi(u)} with

ψ(u)=pu+log ⁣(1p+peu).\psi(u) = -p u + \log\!\bigl(1 - p + p e^u\bigr).

Step 4: bound ψ\psi by u2/8u^2/8. Differentiating,

ψ(u)=p+peu1p+peu=q(u)p,q(u):=peu1p+peu[0,1].\psi'(u) = -p + \frac{p e^u}{1 - p + p e^u} = q(u) - p, \qquad q(u) := \frac{p e^u}{1 - p + p e^u} \in [0, 1].

A second differentiation gives ψ(u)=q(u)(1q(u))1/4\psi''(u) = q(u)(1 - q(u)) \leq 1/4, since q(1q)q(1-q) is maximized at q=1/2q = 1/2. Combined with ψ(0)=0\psi(0) = 0 and ψ(0)=0\psi'(0) = 0, Taylor's theorem with integral remainder yields

ψ(u)=0u(us)ψ(s)ds0u(us)14ds=u28.\psi(u) = \int_0^u (u - s)\, \psi''(s)\, ds \leq \int_0^u (u - s) \cdot \tfrac{1}{4}\, ds = \frac{u^2}{8}.

Step 5: substitute back. Replacing uu by λ(ba)\lambda(b - a) and exponentiating gives the stated bound.

Why It Matters

Every Hoeffding-style concentration statement on this site, including the finite-sum form, the sample-mean form, the Azuma-Hoeffding martingale extension, McDiarmid's bounded-differences inequality, and the finite-class ERM generalization bound, plugs Hoeffding's lemma into the Chernoff method at the per-summand step.

The lemma also says XE[X]X - \mathbb{E}[X] is sub-Gaussian with proxy variance (ba)2/4(b-a)^2/4. That places it inside the sub-Gaussian framework, where additivity of proxy variances under independent sums is built in.

Failure Mode

The lemma needs an almost-sure boundedness assumption. For unbounded variables with finite variance the bound is false in general; use a sub-Gaussian or sub-exponential MGF estimate instead. The constant 1/81/8 is sharp for this convexity-based proof but is not the best possible for every distribution: a Bernoulli(1/2)(1/2) saturates it, while a uniform on [1,1][-1, 1] has a strictly smaller proxy variance under the variance-aware form ((ba)2/12(b-a)^2 / 12 instead of (ba)2/4(b-a)^2 / 4).

From the Lemma to Hoeffding's Inequality

The lemma is the only nontrivial step in the proof of Hoeffding's inequality. The rest is the Chernoff method plus independence.

Corollary

Hoeffding's Inequality from the Lemma

Statement

Let X1,,XnX_1, \ldots, X_n be independent with aiXibia_i \leq X_i \leq b_i almost surely, and let μ=E[Xˉn]\mu = \mathbb{E}[\bar{X}_n]. For every t>0t > 0:

Pr ⁣[Xˉnμt]exp ⁣(2n2t2i=1n(biai)2).\Pr\!\left[\bar{X}_n - \mu \geq t\right] \leq \exp\!\left(-\frac{2 n^2 t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right).

A union bound over the two tails gives

Pr ⁣[Xˉnμt]2exp ⁣(2n2t2i=1n(biai)2).\Pr\!\left[|\bar{X}_n - \mu| \geq t\right] \leq 2 \exp\!\left(-\frac{2 n^2 t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right).

For the special case Xi[a,b]X_i \in [a, b] identically bounded, the bound collapses to 2exp ⁣(2nt2/(ba)2)2 \exp\!\bigl(-2 n t^2 / (b-a)^2\bigr), and for Xi[0,1]X_i \in [0, 1] to 2exp ⁣(2nt2)2 \exp\!\bigl(-2 n t^2\bigr).

Exact statement

LaTeX source for copy/export

\Pr\!\left[\bar{X}_n - \mu \geq t\right] \leq \exp\!\left(-\frac{2 n^2 t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right)

Proof Sketch

Step 1: center. Set Yi=XiE[Xi]Y_i = X_i - \mathbb{E}[X_i]. Each YiY_i has zero mean and lies in [aiE[Xi],biE[Xi]][a_i - \mathbb{E}[X_i],\, b_i - \mathbb{E}[X_i]], which is an interval of width biaib_i - a_i.

Step 2: Chernoff for the centered sum. For any λ>0\lambda > 0,

Pr ⁣[iYint]eλntE ⁣[eλiYi]=eλnti=1nE[eλYi],\Pr\!\left[\sum_i Y_i \geq n t\right] \leq e^{-\lambda n t}\, \mathbb{E}\!\left[e^{\lambda \sum_i Y_i}\right] = e^{-\lambda n t}\, \prod_{i=1}^n \mathbb{E}[e^{\lambda Y_i}],

using independence to factor the MGF.

Step 3: apply Hoeffding's lemma to each factor. Each YiY_i is centered on an interval of width biaib_i - a_i, so

E[eλYi]exp ⁣(λ2(biai)28).\mathbb{E}[e^{\lambda Y_i}] \leq \exp\!\left(\frac{\lambda^2 (b_i - a_i)^2}{8}\right).

Multiplying gives E[eλiYi]exp ⁣(λ2v/8)\mathbb{E}[e^{\lambda \sum_i Y_i}] \leq \exp\!\bigl(\lambda^2 v / 8\bigr) with v=i(biai)2v = \sum_i (b_i - a_i)^2.

Step 4: optimize. The exponent is λnt+λ2v/8-\lambda n t + \lambda^2 v / 8. The optimum is λ=4nt/v\lambda^* = 4 n t / v, giving exponent 2n2t2/v-2 n^2 t^2 / v. The two-sided bound applies the same argument to Yi-Y_i and union-bounds the two tails, costing a factor of two.

Why It Matters

This is the bridge between the abstract MGF estimate and the finite-sample PAC bounds. Setting t=ϵt = \epsilon and inverting the exponential gives the sample-complexity statement n(ba)2log(2/δ)/(2ϵ2)n \geq (b-a)^2 \log(2/\delta) / (2 \epsilon^2) that appears in every introductory treatment of finite-class learning.

Failure Mode

The bound uses biaib_i - a_i as a worst-case range. When a variable's variance is much smaller than (biai)2/4(b_i - a_i)^2 / 4, Hoeffding wastes the gap; Bernstein and Bennett close that gap by using variance information directly.

Bridge: Polynomial vs. Exponential Tails

Hoeffding's lemma is what changes the exponent from polynomial to exponential. The same sample mean Xˉn\bar{X}_n admits both a Chebyshev bound σ2/(nt2)\sigma^2 / (n t^2) and a Hoeffding bound exp(2nt2/(ba)2)\exp\bigl(-2 n t^2 / (b-a)^2\bigr), and the second decays infinitely faster in nn for any fixed tt. The practical effect: the sample size needed to certify Pr[Xˉnμϵ]δ\Pr[|\bar{X}_n - \mu| \geq \epsilon] \leq \delta is O(1/(δϵ2))O(1 / (\delta \epsilon^2)) under Chebyshev and O(log(1/δ)/ϵ2)O(\log(1/\delta) / \epsilon^2) under Hoeffding. This is the reason every PAC sample-complexity statement carries a log(1/δ)\log(1/\delta) inside instead of a 1/δ1/\delta.

Common Confusions

Watch Out

Hoeffding's lemma needs zero mean

The bound E[eλX]eλ2(ba)2/8\mathbb{E}[e^{\lambda X}] \leq e^{\lambda^2 (b-a)^2 / 8} assumes E[X]=0\mathbb{E}[X] = 0. For a non-centered X[a,b]X \in [a, b], the correct statement is that XE[X]X - \mathbb{E}[X] has the bounded MGF, so the standard application is to apply the lemma to the centered variable, not to XX itself.

Watch Out

The constant is 1/8, not 1/2

The proxy variance of a centered [a,b][a, b]-bounded variable is (ba)2/4(b - a)^2 / 4. Inside the MGF the exponent has λ2(ba)2/8\lambda^2 (b - a)^2 / 8, with the extra factor 1/21/2 coming from the Gaussian-style λ2σ2/2\lambda^2 \sigma^2 / 2 scaling. Equivalently, the curvature bound ψ(u)1/4\psi''(u) \leq 1/4 integrates to u2/8u^2/8. A common error is to write λ2(ba)2/4\lambda^2 (b - a)^2 / 4 in the exponent; that would correspond to a sub-Gaussian variance proxy of (ba)2/2(b - a)^2 / 2, which is twice the correct value.

Watch Out

Sharper bounds exist when the variance is small

Hoeffding's lemma only sees the range. A variable concentrated near zero with the same support [a,b][a, b] has the same Hoeffding bound, even though its true MGF is much smaller. Bennett's lemma replaces (ba)2/8(b-a)^2/8 with a variance-driven exponent that is sharper exactly in this regime; see Bennett's inequality.

Exercises

ExerciseCore

Problem

Verify the curvature bound ψ(u)1/4\psi''(u) \leq 1/4 used in step 4 of the proof. Specifically, show that ψ(u)=q(u)(1q(u))\psi''(u) = q(u)(1 - q(u)) with q(u)=peu/(1p+peu)q(u) = p e^u / (1 - p + p e^u), and that q(1q)1/4q(1 - q) \leq 1/4 for every q[0,1]q \in [0, 1].

ExerciseCore

Problem

Apply Hoeffding's lemma to a centered Rademacher variable σ\sigma with Pr[σ=1]=Pr[σ=1]=1/2\Pr[\sigma = 1] = \Pr[\sigma = -1] = 1/2. Compute the resulting MGF bound and compare it to the exact MGF E[eλσ]=cosh(λ)\mathbb{E}[e^{\lambda \sigma}] = \cosh(\lambda).

ExerciseAdvanced

Problem

Use Hoeffding's lemma to prove the following one-sided inequality. Let X1,,XnX_1, \ldots, X_n be i.i.d. with Xi[0,1]X_i \in [0, 1] and E[Xi]=μ\mathbb{E}[X_i] = \mu. Show that for every t0t \geq 0,

Pr[Xˉnμ+t]exp(2nt2).\Pr[\bar{X}_n \geq \mu + t] \leq \exp(-2 n t^2).

State explicitly which step of the derivation invokes the lemma.

References

Canonical:

  • Hoeffding, W. (1963). "Probability Inequalities for Sums of Bounded Random Variables." Journal of the American Statistical Association, 58(301), 13-30. The original lemma is stated mid-paper as the MGF bound on which Hoeffding's inequality rests.
  • Boucheron, S., Lugosi, G., & Massart, P. (2013). Concentration Inequalities. Oxford University Press. Lemma 2.2 (Hoeffding's lemma) and Theorem 2.8 (Hoeffding's inequality).
  • Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning. Cambridge University Press. Lemma B.7 (Hoeffding's lemma) and Lemma B.6 (Hoeffding's inequality) in Appendix B.

Current:

  • Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. Section 2.1.2 derives the lemma in the sub-Gaussian framework.
  • Vershynin, R. (2018). High-Dimensional Probability. Cambridge University Press. Theorem 2.2.6 (Hoeffding's lemma) with discussion of sharpness.
  • van Handel, R. (2016). Probability in High Dimension. Lecture notes, Princeton. Chapter 3 develops Hoeffding's lemma alongside the broader sub-Gaussian framework.

Next Topics

Last reviewed: May 8, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

6

Derived topics

0

No published topic currently declares this as a prerequisite.