Skip to main content

Concentration Probability

Bennett's Inequality

A variance-aware concentration inequality for independent bounded random variables. The exponent uses the function h(a) = (1+a)log(1+a) - a, the same h that controls the multiplicative Chernoff bound for Bernoullis.

AdvancedTier 1StableSupporting~25 min

Why This Matters

Hoeffding throws away the variance. Bernstein keeps a quadratic-in-variance exponent at the cost of some looseness for very rare deviations. Bennett's inequality sits between them: it uses the same variance information as Bernstein but expresses the tail through the rate function

h(a)=(1+a)log(1+a)a,a0,h(a) = (1 + a) \log(1 + a) - a, \qquad a \geq 0,

which is the exact log-MGF rate of a centered Poisson. Bennett is the sharpest of the three for sums of independent bounded variables when the variance is small relative to the squared range, and the entire family of exponential bounds (multiplicative Chernoff, Bennett, Bernstein) lives inside the same identity, separated only by which inequality you use to simplify hh.

The pedagogical point is the unification: once you have the Bennett MGF lemma, you can read off the multiplicative Chernoff bound for Bernoullis and the Bernstein bound for general bounded summands as two corollaries that simplify hh in different regimes.

Mental Model

Bennett's exponent factors as nσ2n \sigma^2 times hh of a relative deviation. Two anchoring facts:

  1. Poisson saturates hh. A centered Poisson(ν)(\nu) random variable has log-MGF ν(eλ1)νλ=ν(eλ1λ)\nu(e^\lambda - 1) - \nu \lambda = \nu (e^\lambda - 1 - \lambda). Optimizing over λ\lambda in the Chernoff method yields νh(t/ν)\nu \cdot h(t/\nu) for the upper-tail rate. Bennett says every centered bounded summand with variance proxy σ2\sigma^2 is at least as concentrated as a centered Poisson with mean σ2\sigma^2; the inequality is sharp at the Poisson.
  2. h(a)a2/(2+2a/3)h(a) \geq a^2 / (2 + 2a/3). This single inequality converts every Bennett bound into a Bernstein bound. The right-hand side is the same denominator 2v+(2/3)Mt2v + (2/3)Mt that appears in Bernstein, with the factor of MM folded into the relative-deviation argument a=Mt/va = M t / v.

Formal Setup

Let X1,,XnX_1, \ldots, X_n be independent random variables with E[Xi]=0\mathbb{E}[X_i] = 0, XiMX_i \leq M almost surely (an upper bound on deviations, not on the absolute value), and finite variances. Define

v=i=1nVar(Xi)=i=1nE[Xi2],Sn=i=1nXi.v = \sum_{i=1}^n \mathrm{Var}(X_i) = \sum_{i=1}^n \mathbb{E}[X_i^2], \qquad S_n = \sum_{i=1}^n X_i.

The standardized argument of the rate function is a=Mt/va = M t / v, the size of a single allowed jump MM times the deviation tt, normalized by the total variance vv.

Theorem

Bennett's Inequality

Statement

For every t0t \geq 0,

Pr ⁣[i=1nXit]exp ⁣(vM2h ⁣(Mtv)),\Pr\!\left[\sum_{i=1}^n X_i \geq t\right] \leq \exp\!\left(-\frac{v}{M^2}\, h\!\left(\frac{M t}{v}\right)\right),

where h(a)=(1+a)log(1+a)ah(a) = (1 + a) \log(1 + a) - a is the Poisson rate function.

Equivalently, with σ2=v/n\sigma^2 = v / n for i.i.d. summands and the sample mean form Xˉn=Sn/n\bar{X}_n = S_n / n:

Pr ⁣[Xˉnt]exp ⁣(nσ2M2h ⁣(Mtσ2)).\Pr\!\left[\bar{X}_n \geq t\right] \leq \exp\!\left(-\frac{n \sigma^2}{M^2}\, h\!\left(\frac{M t}{\sigma^2}\right)\right).

Exact statement

LaTeX source for copy/export

\Pr\!\left[S_n \geq t\right] \leq \exp\!\left(-\frac{v}{M^2}\, h\!\left(\frac{M t}{v}\right)\right)

Intuition

The factor v/M2v / M^2 is the effective number of "Poisson units": each unit has variance roughly M2M^2, so v/M2v / M^2 counts how many independent Poisson-scale fluctuations contribute. The argument Mt/vM t / v is the relative deviation in those units. The rate function hh is the sharp Poisson large-deviations rate; Bennett says no centered bounded variable deviates faster than a centered Poisson with matched variance.

Proof Sketch

Step 1: per-summand MGF lemma. For any centered random variable XX with XMX \leq M a.s. and E[X2]=s2\mathbb{E}[X^2] = s^2,

E[eλX]exp ⁣(s2M2(eλM1λM))for all λ0.\mathbb{E}[e^{\lambda X}] \leq \exp\!\left(\frac{s^2}{M^2}\bigl(e^{\lambda M} - 1 - \lambda M\bigr)\right) \quad \text{for all } \lambda \geq 0.

This is the Bennett MGF lemma. It is proved by writing eλx1λx=λ2x2ϕ(λx)e^{\lambda x} - 1 - \lambda x = \lambda^2 x^2 \phi(\lambda x) with ϕ\phi increasing in its argument, then bounding ϕ(λx)ϕ(λM)\phi(\lambda x) \leq \phi(\lambda M) on {xM}\{x \leq M\} and taking expectations using E[X2]=s2\mathbb{E}[X^2] = s^2.

Step 2: independence multiplies the lemma. With XiX_i independent and v=iVar(Xi)v = \sum_i \mathrm{Var}(X_i),

E[eλSn]=iE[eλXi]exp ⁣(vM2(eλM1λM)).\mathbb{E}[e^{\lambda S_n}] = \prod_i \mathbb{E}[e^{\lambda X_i}] \leq \exp\!\left(\frac{v}{M^2} (e^{\lambda M} - 1 - \lambda M)\right).

Step 3: Chernoff method and optimize. Pr[Snt]eλtE[eλSn]\Pr[S_n \geq t] \leq e^{-\lambda t}\, \mathbb{E}[e^{\lambda S_n}]. Setting λ=1Mlog ⁣(1+Mt/v)\lambda^* = \frac{1}{M} \log\!\bigl(1 + M t / v\bigr) minimizes the exponent and substitutes the standardized variable a=Mt/va = M t / v:

infλ>0[λt+(v/M2)(eλM1λM)]=vM2h(a).\inf_{\lambda > 0}\bigl[-\lambda t + (v/M^2)(e^{\lambda M} - 1 - \lambda M)\bigr] = -\frac{v}{M^2} \, h(a).

Exponentiating gives the displayed bound.

Why It Matters

Bennett is the cleanest variance-sensitive bound to derive from the Chernoff method, and every other variance-sensitive scalar inequality on this site is a corollary obtained by simplifying hh.

Failure Mode

The summands must be bounded above (XiMX_i \leq M) almost surely. Without that, the Bennett MGF lemma fails. For sub-exponential variables without an almost-sure upper bound, use the sub-exponential framework and its variance-Orlicz analogue. The bound is also one-sided; a two-sided version costs a factor of two by union bound, applied to Xi-X_i on the mirror tail when those variables are also bounded below.

Bernstein from Bennett

The single algebraic inequality

h(a)a22+2a/3for a0h(a) \geq \frac{a^2}{2 + 2 a / 3} \quad \text{for } a \geq 0

converts Bennett into Bernstein. Substituting a=Mt/va = M t / v on the right-hand side and simplifying gives the familiar Bernstein denominator 2v+(2/3)Mt2 v + (2/3) M t.

Corollary

Bernstein's Inequality from Bennett

Statement

Under the Bennett assumptions plus the symmetric bound XiM|X_i| \leq M,

Pr[Snt]exp ⁣(t2/2v+Mt/3)for every t0.\Pr[S_n \geq t] \leq \exp\!\left(-\frac{t^2 / 2}{v + M t / 3}\right) \qquad \text{for every } t \geq 0.

The two-sided version, applying the same bound to Sn-S_n,

Pr[Snt]2exp ⁣(t2/2v+Mt/3).\Pr[|S_n| \geq t] \leq 2 \exp\!\left(-\frac{t^2 / 2}{v + M t / 3}\right).

Exact statement

LaTeX source for copy/export

\Pr[S_n \geq t] \leq \exp\!\left(-\frac{t^2 / 2}{v + M t / 3}\right)

Proof Sketch

Apply the inequality h(a)a2/(2+2a/3)h(a) \geq a^2 / (2 + 2a/3) inside the Bennett bound:

vM2h ⁣(Mtv)vM2(Mt/v)22+(2Mt)/(3v)=t2/2v+Mt/3.\frac{v}{M^2}\, h\!\left(\frac{M t}{v}\right) \geq \frac{v}{M^2} \cdot \frac{(M t / v)^2}{2 + (2 M t) / (3 v)} = \frac{t^2 / 2}{v + M t / 3}.

Exponentiating with a sign flip gives the Bernstein form.

Why It Matters

Bernstein is the bound most often quoted in learning theory because the two-regime denominator v+Mt/3v + M t / 3 is easy to invert: small tt gives the variance-driven Gaussian regime, and large tt gives the bounded-jump sub-exponential regime. Bennett is sharper but harder to invert in closed form.

Failure Mode

The simplification h(a)a2/(2+2a/3)h(a) \geq a^2 / (2 + 2 a/3) loses tightness for very large aa. For deviations tt that are many standard deviations large relative to MM, Bennett gives a noticeably tighter exponent than Bernstein, and using Bernstein there leaves performance on the table.

Comparison Table

The four scalar bounds form a hierarchy by what they use about the summands.

InequalityVariablesVariance information usedBound form
Chebyshevany with finite varianceσ2\sigma^2σ2/(nt2)\sigma^2 / (n t^2) (polynomial)
Hoeffdingbounded [a,b][a, b]none (range only)exp ⁣(2nt2/(ba)2)\exp\!\bigl(-2 n t^2 / (b-a)^2\bigr)
Bennettbounded above, zero meanper-summand varianceexp ⁣((v/M2)h(Mt/v))\exp\!\bigl(-(v/M^2)\, h(M t / v)\bigr)
Bernsteinbounded, zero meanaggregate variance vvexp ⁣(t2/(2v+2Mt/3))\exp\!\bigl(-t^2/(2 v + 2 M t / 3)\bigr)

For Bernoulli summands with small mean μ\mu and many trials, Bennett with M=1M = 1 and v=μ(1μ)nv = \mu(1 - \mu) n recovers the multiplicative Chernoff bound exp(μnh(δ))\exp(-\mu n h(\delta)); this is the shared rate function appearing in both results.

Common Confusions

Watch Out

Bennett needs only an upper bound, not a two-sided bound

The one-sided version of Bennett uses XiMX_i \leq M, with no lower-bound constraint. Bernstein typically assumes XiM|X_i| \leq M, so the conversion from Bennett to Bernstein silently strengthens the assumption to a two-sided bound. For one-sided tails of nonnegative summands, Bennett by itself is the right statement.

Watch Out

The h function is the multiplicative Chernoff rate

The same h(a)=(1+a)log(1+a)ah(a) = (1 + a) \log(1 + a) - a that appears in Bennett is the exponent of the multiplicative Chernoff bound for Pr[S(1+δ)μ]\Pr[S \geq (1 + \delta) \mu] on Bernoulli sums, with aa replaced by δ\delta. The two results are not separate theorems; they are the same Chernoff identity, specialized to two different MGF inputs.

Watch Out

The variance must be a real upper bound

Bennett's denominator uses the per-summand E[Xi2]\mathbb{E}[X_i^2] as a variance proxy. Plugging in a loose upper bound on Var(Xi)\mathrm{Var}(X_i) weakens the inequality and can erase the gap between Bennett and Hoeffding. For empirical-variance settings, see the empirical Bernstein refinement on the Bernstein inequality page.

Exercises

ExerciseCore

Problem

Verify the inequality h(a)a2/(2+2a/3)h(a) \geq a^2 / (2 + 2 a / 3) used to derive Bernstein from Bennett. Specifically, define g(a)=h(a)a2/(2+2a/3)g(a) = h(a) - a^2 / (2 + 2 a / 3) and show g(a)0g(a) \geq 0 for all a0a \geq 0.

ExerciseAdvanced

Problem

Suppose X1,,XnX_1, \ldots, X_n are i.i.d. centered Bernoulli with Pr[Xi=1p]=p\Pr[X_i = 1 - p] = p and Pr[Xi=p]=1p\Pr[X_i = -p] = 1 - p, so each Xi[p,1p]X_i \in [-p, 1 - p] and Var(Xi)=p(1p)\mathrm{Var}(X_i) = p(1 - p). Take M=1pM = 1 - p as the upper bound and write Bennett's bound for Pr[iXit]\Pr[\sum_i X_i \geq t]. Show that, for pp near zero, the Bennett bound is much tighter than Hoeffding's exp(2t2/n)\exp(-2 t^2 / n).

References

Canonical:

  • Bennett, G. (1962). "Probability Inequalities for the Sum of Independent Random Variables." Journal of the American Statistical Association, 57(297), 33-45. The original paper.
  • Boucheron, S., Lugosi, G., & Massart, P. (2013). Concentration Inequalities. Oxford University Press. Theorem 2.9 (Bennett) and Theorem 2.10 (Bernstein), with the explicit reduction h(a)a2/(2+2a/3)h(a) \geq a^2 / (2 + 2 a / 3) in Section 2.7.
  • Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning. Cambridge University Press. Lemma B.8 (Bennett) and Lemma B.9 (Bernstein) in Appendix B.

Current:

  • Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. Section 2.1.3 develops Bennett alongside the sub-exponential framework.
  • Vershynin, R. (2018). High-Dimensional Probability. Cambridge University Press. Theorem 2.8.4 (Bernstein) with discussion of the Bennett refinement.
  • Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2018). Foundations of Machine Learning (2nd ed.). MIT Press. Appendix D states Bennett and Bernstein side by side and uses them in PAC-style sample-complexity bounds.

Next Topics

Last reviewed: May 8, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

4

Derived topics

0

No published topic currently declares this as a prerequisite.