Bennett's Inequality

Sneiderman, Robby

Concentration Probability

Bennett's Inequality

A variance-aware concentration inequality for independent bounded random variables. The exponent uses the function h(a) = (1+a)log(1+a) - a, the same h that controls the multiplicative Chernoff bound for Bernoullis.

AdvancedTier 1StableSupporting~25 min

Prerequisites

Concentration Inequalities Chernoff Bounds Moment Generating Functions Expectation Variance Covariance Moments

Prereq Map

Why This Matters

Hoeffding throws away the variance. Bernstein keeps a quadratic-in-variance exponent at the cost of some looseness for very rare deviations. Bennett's inequality sits between them: it uses the same variance information as Bernstein but expresses the tail through the rate function

$h(a) = (1 + a) \log(1 + a) - a, \qquad a \geq 0,$

which is the exact log-MGF rate of a centered Poisson. Bennett is the sharpest of the three for sums of independent bounded variables when the variance is small relative to the squared range, and the entire family of exponential bounds (multiplicative Chernoff, Bennett, Bernstein) lives inside the same identity, separated only by which inequality you use to simplify $h$ .

The pedagogical point is the unification: once you have the Bennett MGF lemma, you can read off the multiplicative Chernoff bound for Bernoullis and the Bernstein bound for general bounded summands as two corollaries that simplify $h$ in different regimes.

Mental Model

Bennett's exponent factors as $n \sigma^2$ times $h$ of a relative deviation. Two anchoring facts:

Poisson saturates $h$ . A centered Poisson $(\nu)$ random variable has log-MGF $\nu(e^\lambda - 1) - \nu \lambda = \nu (e^\lambda - 1 - \lambda)$ . Optimizing over $\lambda$ in the Chernoff method yields $\nu \cdot h(t/\nu)$ for the upper-tail rate. Bennett says every centered bounded summand with variance proxy $\sigma^2$ is at least as concentrated as a centered Poisson with mean $\sigma^2$ ; the inequality is sharp at the Poisson.
$h(a) \geq a^2 / (2 + 2a/3)$ . This single inequality converts every Bennett bound into a Bernstein bound. The right-hand side is the same denominator $2v + (2/3)Mt$ that appears in Bernstein, with the factor of $M$ folded into the relative-deviation argument $a = M t / v$ .

Formal Setup

Let $X_1, \ldots, X_n$ be independent random variables with $\mathbb{E}[X_i] = 0$ , $X_i \leq M$ almost surely (an upper bound on deviations, not on the absolute value), and finite variances. Define

$v = \sum_{i=1}^n \mathrm{Var}(X_i) = \sum_{i=1}^n \mathbb{E}[X_i^2], \qquad S_n = \sum_{i=1}^n X_i.$

The standardized argument of the rate function is $a = M t / v$ , the size of a single allowed jump $M$ times the deviation $t$ , normalized by the total variance $v$ .

Theorem

Bennett's Inequality

Statement

For every $t \geq 0$ ,

$\Pr\!\left[\sum_{i=1}^n X_i \geq t\right] \leq \exp\!\left(-\frac{v}{M^2}\, h\!\left(\frac{M t}{v}\right)\right),$

where $h(a) = (1 + a) \log(1 + a) - a$ is the Poisson rate function.

Equivalently, with $\sigma^2 = v / n$ for i.i.d. summands and the sample mean form $\bar{X}_n = S_n / n$ :

$\Pr\!\left[\bar{X}_n \geq t\right] \leq \exp\!\left(-\frac{n \sigma^2}{M^2}\, h\!\left(\frac{M t}{\sigma^2}\right)\right).$

Exact statement

Pr [S_{n} \geq t] \leq exp (- \frac{v}{M ^{2}} h (\frac{M t}{v}))

LaTeX source for copy/export

\Pr\!\left[S_n \geq t\right] \leq \exp\!\left(-\frac{v}{M^2}\, h\!\left(\frac{M t}{v}\right)\right)

Intuition

The factor $v / M^2$ is the effective number of "Poisson units": each unit has variance roughly $M^2$ , so $v / M^2$ counts how many independent Poisson-scale fluctuations contribute. The argument $M t / v$ is the relative deviation in those units. The rate function $h$ is the sharp Poisson large-deviations rate; Bennett says no centered bounded variable deviates faster than a centered Poisson with matched variance.

Proof Sketch

Step 1: per-summand MGF lemma. For any centered random variable $X$ with $X \leq M$ a.s. and $\mathbb{E}[X^2] = s^2$ ,

$\mathbb{E}[e^{\lambda X}] \leq \exp\!\left(\frac{s^2}{M^2}\bigl(e^{\lambda M} - 1 - \lambda M\bigr)\right) \quad \text{for all } \lambda \geq 0.$

This is the Bennett MGF lemma. It is proved by writing $e^{\lambda x} - 1 - \lambda x = \lambda^2 x^2 \phi(\lambda x)$ with $\phi$ increasing in its argument, then bounding $\phi(\lambda x) \leq \phi(\lambda M)$ on $\{x \leq M\}$ and taking expectations using $\mathbb{E}[X^2] = s^2$ .

Step 2: independence multiplies the lemma. With $X_i$ independent and $v = \sum_i \mathrm{Var}(X_i)$ ,

$\mathbb{E}[e^{\lambda S_n}] = \prod_i \mathbb{E}[e^{\lambda X_i}] \leq \exp\!\left(\frac{v}{M^2} (e^{\lambda M} - 1 - \lambda M)\right).$

Step 3: Chernoff method and optimize. $\Pr[S_n \geq t] \leq e^{-\lambda t}\, \mathbb{E}[e^{\lambda S_n}]$ . Setting $\lambda^* = \frac{1}{M} \log\!\bigl(1 + M t / v\bigr)$ minimizes the exponent and substitutes the standardized variable $a = M t / v$ :

$\inf_{\lambda > 0}\bigl[-\lambda t + (v/M^2)(e^{\lambda M} - 1 - \lambda M)\bigr] = -\frac{v}{M^2} \, h(a).$

Exponentiating gives the displayed bound.

Why It Matters

Bennett is the cleanest variance-sensitive bound to derive from the Chernoff method, and every other variance-sensitive scalar inequality on this site is a corollary obtained by simplifying $h$ .

Failure Mode

The summands must be bounded above ( $X_i \leq M$ ) almost surely. Without that, the Bennett MGF lemma fails. For sub-exponential variables without an almost-sure upper bound, use the sub-exponential framework and its variance-Orlicz analogue. The bound is also one-sided; a two-sided version costs a factor of two by union bound, applied to $-X_i$ on the mirror tail when those variables are also bounded below.

report a correction →

Bernstein from Bennett

The single algebraic inequality

$h(a) \geq \frac{a^2}{2 + 2 a / 3} \quad \text{for } a \geq 0$

converts Bennett into Bernstein. Substituting $a = M t / v$ on the right-hand side and simplifying gives the familiar Bernstein denominator $2 v + (2/3) M t$ .

Corollary

Bernstein's Inequality from Bennett

Statement

Under the Bennett assumptions plus the symmetric bound $|X_i| \leq M$ ,

$\Pr[S_n \geq t] \leq \exp\!\left(-\frac{t^2 / 2}{v + M t / 3}\right) \qquad \text{for every } t \geq 0.$

The two-sided version, applying the same bound to $-S_n$ ,

$\Pr[|S_n| \geq t] \leq 2 \exp\!\left(-\frac{t^2 / 2}{v + M t / 3}\right).$

Exact statement

Pr [S_{n} \geq t] \leq exp (- \frac{t ^{2} /2}{v + M t /3})

LaTeX source for copy/export

\Pr[S_n \geq t] \leq \exp\!\left(-\frac{t^2 / 2}{v + M t / 3}\right)

Proof Sketch

Apply the inequality $h(a) \geq a^2 / (2 + 2a/3)$ inside the Bennett bound:

$\frac{v}{M^2}\, h\!\left(\frac{M t}{v}\right) \geq \frac{v}{M^2} \cdot \frac{(M t / v)^2}{2 + (2 M t) / (3 v)} = \frac{t^2 / 2}{v + M t / 3}.$

Exponentiating with a sign flip gives the Bernstein form.

Why It Matters

Bernstein is the bound most often quoted in learning theory because the two-regime denominator $v + M t / 3$ is easy to invert: small $t$ gives the variance-driven Gaussian regime, and large $t$ gives the bounded-jump sub-exponential regime. Bennett is sharper but harder to invert in closed form.

Failure Mode

The simplification $h(a) \geq a^2 / (2 + 2 a/3)$ loses tightness for very large $a$ . For deviations $t$ that are many standard deviations large relative to $M$ , Bennett gives a noticeably tighter exponent than Bernstein, and using Bernstein there leaves performance on the table.

report a correction →

Comparison Table

The four scalar bounds form a hierarchy by what they use about the summands.

Inequality	Variables	Variance information used	Bound form
Chebyshev	any with finite variance	$\sigma^2$	$\sigma^2 / (n t^2)$ (polynomial)
Hoeffding	bounded $[a, b]$	none (range only)	$\exp\!\bigl(-2 n t^2 / (b-a)^2\bigr)$
Bennett	bounded above, zero mean	per-summand variance	$\exp\!\bigl(-(v/M^2)\, h(M t / v)\bigr)$
Bernstein	bounded, zero mean	aggregate variance $v$	$\exp\!\bigl(-t^2/(2 v + 2 M t / 3)\bigr)$

For Bernoulli summands with small mean $\mu$ and many trials, Bennett with $M = 1$ and $v = \mu(1 - \mu) n$ recovers the multiplicative Chernoff bound $\exp(-\mu n h(\delta))$ ; this is the shared rate function appearing in both results.

Common Confusions

Watch Out

Bennett needs only an upper bound, not a two-sided bound

The one-sided version of Bennett uses $X_i \leq M$ , with no lower-bound constraint. Bernstein typically assumes $|X_i| \leq M$ , so the conversion from Bennett to Bernstein silently strengthens the assumption to a two-sided bound. For one-sided tails of nonnegative summands, Bennett by itself is the right statement.

Watch Out

The h function is the multiplicative Chernoff rate

The same $h(a) = (1 + a) \log(1 + a) - a$ that appears in Bennett is the exponent of the multiplicative Chernoff bound for $\Pr[S \geq (1 + \delta) \mu]$ on Bernoulli sums, with $a$ replaced by $\delta$ . The two results are not separate theorems; they are the same Chernoff identity, specialized to two different MGF inputs.

Watch Out

The variance must be a real upper bound

Bennett's denominator uses the per-summand $\mathbb{E}[X_i^2]$ as a variance proxy. Plugging in a loose upper bound on $\mathrm{Var}(X_i)$ weakens the inequality and can erase the gap between Bennett and Hoeffding. For empirical-variance settings, see the empirical Bernstein refinement on the Bernstein inequality page.

Exercises

ExerciseCore

Problem

Verify the inequality $h(a) \geq a^2 / (2 + 2 a / 3)$ used to derive Bernstein from Bennett. Specifically, define $g(a) = h(a) - a^2 / (2 + 2 a / 3)$ and show $g(a) \geq 0$ for all $a \geq 0$ .

ExerciseAdvanced

Problem

Suppose $X_1, \ldots, X_n$ are i.i.d. centered Bernoulli with $\Pr[X_i = 1 - p] = p$ and $\Pr[X_i = -p] = 1 - p$ , so each $X_i \in [-p, 1 - p]$ and $\mathrm{Var}(X_i) = p(1 - p)$ . Take $M = 1 - p$ as the upper bound and write Bennett's bound for $\Pr[\sum_i X_i \geq t]$ . Show that, for $p$ near zero, the Bennett bound is much tighter than Hoeffding's $\exp(-2 t^2 / n)$ .

References

Canonical:

Bennett, G. (1962). "Probability Inequalities for the Sum of Independent Random Variables." Journal of the American Statistical Association, 57(297), 33-45. The original paper.
Boucheron, S., Lugosi, G., & Massart, P. (2013). Concentration Inequalities. Oxford University Press. Theorem 2.9 (Bennett) and Theorem 2.10 (Bernstein), with the explicit reduction $h(a) \geq a^2 / (2 + 2 a / 3)$ in Section 2.7.
Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning. Cambridge University Press. Lemma B.8 (Bennett) and Lemma B.9 (Bernstein) in Appendix B.

Current:

Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. Section 2.1.3 develops Bennett alongside the sub-exponential framework.
Vershynin, R. (2018). High-Dimensional Probability. Cambridge University Press. Theorem 2.8.4 (Bernstein) with discussion of the Bennett refinement.
Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2018). Foundations of Machine Learning (2nd ed.). MIT Press. Appendix D states Bennett and Bernstein side by side and uses them in PAC-style sample-complexity bounds.

Next Topics

Bernstein's inequality: the standard sample-mean form derived from the $h(a) \geq a^2 / (2 + 2 a/3)$ relaxation
Sub-exponential random variables: the distributional class behind Bennett-Bernstein-type tails without an almost-sure bound
Matrix concentration: matrix Bennett and matrix Bernstein for sums of independent random matrices
Hoeffding's lemma: the variance-blind cousin that uses range only

Last reviewed: May 8, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
Chernoff Boundslayer 1 · tier 1
Concentration Inequalitieslayer 1 · tier 1
Moment Generating Functionslayer 0A · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.