Concentration Probability
Bennett's Inequality
A variance-aware concentration inequality for independent bounded random variables. The exponent uses the function h(a) = (1+a)log(1+a) - a, the same h that controls the multiplicative Chernoff bound for Bernoullis.
Prerequisites
Why This Matters
Hoeffding throws away the variance. Bernstein keeps a quadratic-in-variance exponent at the cost of some looseness for very rare deviations. Bennett's inequality sits between them: it uses the same variance information as Bernstein but expresses the tail through the rate function
which is the exact log-MGF rate of a centered Poisson. Bennett is the sharpest of the three for sums of independent bounded variables when the variance is small relative to the squared range, and the entire family of exponential bounds (multiplicative Chernoff, Bennett, Bernstein) lives inside the same identity, separated only by which inequality you use to simplify .
The pedagogical point is the unification: once you have the Bennett MGF lemma, you can read off the multiplicative Chernoff bound for Bernoullis and the Bernstein bound for general bounded summands as two corollaries that simplify in different regimes.
Mental Model
Bennett's exponent factors as times of a relative deviation. Two anchoring facts:
- Poisson saturates . A centered Poisson random variable has log-MGF . Optimizing over in the Chernoff method yields for the upper-tail rate. Bennett says every centered bounded summand with variance proxy is at least as concentrated as a centered Poisson with mean ; the inequality is sharp at the Poisson.
- . This single inequality converts every Bennett bound into a Bernstein bound. The right-hand side is the same denominator that appears in Bernstein, with the factor of folded into the relative-deviation argument .
Formal Setup
Let be independent random variables with , almost surely (an upper bound on deviations, not on the absolute value), and finite variances. Define
The standardized argument of the rate function is , the size of a single allowed jump times the deviation , normalized by the total variance .
Bennett's Inequality
Statement
For every ,
where is the Poisson rate function.
Equivalently, with for i.i.d. summands and the sample mean form :
Exact statement
LaTeX source for copy/export
\Pr\!\left[S_n \geq t\right] \leq \exp\!\left(-\frac{v}{M^2}\, h\!\left(\frac{M t}{v}\right)\right)Intuition
The factor is the effective number of "Poisson units": each unit has variance roughly , so counts how many independent Poisson-scale fluctuations contribute. The argument is the relative deviation in those units. The rate function is the sharp Poisson large-deviations rate; Bennett says no centered bounded variable deviates faster than a centered Poisson with matched variance.
Proof Sketch
Step 1: per-summand MGF lemma. For any centered random variable with a.s. and ,
This is the Bennett MGF lemma. It is proved by writing with increasing in its argument, then bounding on and taking expectations using .
Step 2: independence multiplies the lemma. With independent and ,
Step 3: Chernoff method and optimize. . Setting minimizes the exponent and substitutes the standardized variable :
Exponentiating gives the displayed bound.
Why It Matters
Bennett is the cleanest variance-sensitive bound to derive from the Chernoff method, and every other variance-sensitive scalar inequality on this site is a corollary obtained by simplifying .
Failure Mode
The summands must be bounded above () almost surely. Without that, the Bennett MGF lemma fails. For sub-exponential variables without an almost-sure upper bound, use the sub-exponential framework and its variance-Orlicz analogue. The bound is also one-sided; a two-sided version costs a factor of two by union bound, applied to on the mirror tail when those variables are also bounded below.
Bernstein from Bennett
The single algebraic inequality
converts Bennett into Bernstein. Substituting on the right-hand side and simplifying gives the familiar Bernstein denominator .
Bernstein's Inequality from Bennett
Statement
Under the Bennett assumptions plus the symmetric bound ,
The two-sided version, applying the same bound to ,
Exact statement
LaTeX source for copy/export
\Pr[S_n \geq t] \leq \exp\!\left(-\frac{t^2 / 2}{v + M t / 3}\right)Proof Sketch
Apply the inequality inside the Bennett bound:
Exponentiating with a sign flip gives the Bernstein form.
Why It Matters
Bernstein is the bound most often quoted in learning theory because the two-regime denominator is easy to invert: small gives the variance-driven Gaussian regime, and large gives the bounded-jump sub-exponential regime. Bennett is sharper but harder to invert in closed form.
Failure Mode
The simplification loses tightness for very large . For deviations that are many standard deviations large relative to , Bennett gives a noticeably tighter exponent than Bernstein, and using Bernstein there leaves performance on the table.
Comparison Table
The four scalar bounds form a hierarchy by what they use about the summands.
| Inequality | Variables | Variance information used | Bound form |
|---|---|---|---|
| Chebyshev | any with finite variance | (polynomial) | |
| Hoeffding | bounded | none (range only) | |
| Bennett | bounded above, zero mean | per-summand variance | |
| Bernstein | bounded, zero mean | aggregate variance |
For Bernoulli summands with small mean and many trials, Bennett with and recovers the multiplicative Chernoff bound ; this is the shared rate function appearing in both results.
Common Confusions
Bennett needs only an upper bound, not a two-sided bound
The one-sided version of Bennett uses , with no lower-bound constraint. Bernstein typically assumes , so the conversion from Bennett to Bernstein silently strengthens the assumption to a two-sided bound. For one-sided tails of nonnegative summands, Bennett by itself is the right statement.
The h function is the multiplicative Chernoff rate
The same that appears in Bennett is the exponent of the multiplicative Chernoff bound for on Bernoulli sums, with replaced by . The two results are not separate theorems; they are the same Chernoff identity, specialized to two different MGF inputs.
The variance must be a real upper bound
Bennett's denominator uses the per-summand as a variance proxy. Plugging in a loose upper bound on weakens the inequality and can erase the gap between Bennett and Hoeffding. For empirical-variance settings, see the empirical Bernstein refinement on the Bernstein inequality page.
Exercises
Problem
Verify the inequality used to derive Bernstein from Bennett. Specifically, define and show for all .
Problem
Suppose are i.i.d. centered Bernoulli with and , so each and . Take as the upper bound and write Bennett's bound for . Show that, for near zero, the Bennett bound is much tighter than Hoeffding's .
References
Canonical:
- Bennett, G. (1962). "Probability Inequalities for the Sum of Independent Random Variables." Journal of the American Statistical Association, 57(297), 33-45. The original paper.
- Boucheron, S., Lugosi, G., & Massart, P. (2013). Concentration Inequalities. Oxford University Press. Theorem 2.9 (Bennett) and Theorem 2.10 (Bernstein), with the explicit reduction in Section 2.7.
- Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning. Cambridge University Press. Lemma B.8 (Bennett) and Lemma B.9 (Bernstein) in Appendix B.
Current:
- Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. Section 2.1.3 develops Bennett alongside the sub-exponential framework.
- Vershynin, R. (2018). High-Dimensional Probability. Cambridge University Press. Theorem 2.8.4 (Bernstein) with discussion of the Bennett refinement.
- Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2018). Foundations of Machine Learning (2nd ed.). MIT Press. Appendix D states Bennett and Bernstein side by side and uses them in PAC-style sample-complexity bounds.
Next Topics
- Bernstein's inequality: the standard sample-mean form derived from the relaxation
- Sub-exponential random variables: the distributional class behind Bennett-Bernstein-type tails without an almost-sure bound
- Matrix concentration: matrix Bennett and matrix Bernstein for sums of independent random matrices
- Hoeffding's lemma: the variance-blind cousin that uses range only
Last reviewed: May 8, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
4- Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
- Chernoff Boundslayer 1 · tier 1
- Concentration Inequalitieslayer 1 · tier 1
- Moment Generating Functionslayer 0A · tier 2
Derived topics
0No published topic currently declares this as a prerequisite.