What Each Measures
Both Hoeffding and Bernstein bound the tail probability for the sample mean of independent random variables. They answer the same question: how far can the empirical average stray from the true mean? But they use different information and give different answers.
Hoeffding uses only the bounded range of each variable.
Bernstein uses both the bounded range and the variance .
Side-by-Side Statement
Hoeffding's Inequality
If are independent, then:
For i.i.d. variables in :
Bernstein's Inequality
If are independent with variance , then:
For the i.i.d. sample mean with variance :
Where Each Is Stronger
Hoeffding wins on simplicity
Hoeffding requires only one piece of information: the range . There is no variance to estimate, no extra parameter to tune. If you have bounded random variables and no variance information, Hoeffding is the right tool. The bound is clean, easy to invert, and the proof is short.
Bernstein wins on tightness when variance is small
Consider a random variable with . The variance is at most , but the range is , so . Hoeffding treats the variable as if it could take any value in with equal ease. Bernstein knows the variable is usually near zero and penalizes accordingly.
Concretely, for i.i.d. copies with and :
| Deviation | Hoeffding exponent | Bernstein exponent |
|---|---|---|
| 0.01 | ||
| 0.1 |
Bernstein gives roughly 10 to 20x tighter exponents here. The improvement grows as the variance shrinks relative to the range.
Where Each Fails
Hoeffding fails when the range is loose
If but , Hoeffding treats the variable as if it had variance . The bound is 10,000 times looser than necessary. Bernstein would use and give a bound that is orders of magnitude tighter.
Bernstein fails when variance is unknown
Bernstein requires knowing (or upper-bounding) the variance. In many settings you have bounded data but no reliable variance estimate. You can estimate variance from data, but then you need a concentration bound on the variance estimate itself, creating a circular dependency that adds technical complexity.
Both fail for unbounded random variables
Neither Hoeffding nor Bernstein applies to unbounded random variables. For Gaussians and other unbounded distributions, you need the sub-Gaussian or sub-exponential framework, which generalizes both inequalities.
Key Assumptions That Differ
| Hoeffding | Bernstein | |
|---|---|---|
| Input | Range | Range and variance |
| Independence | Required | Required |
| Boundedness | Required () | Required ($ |
| Variance | Not used | Used explicitly |
| Proof technique | MGF + Hoeffding's lemma | MGF + Bernstein's condition |
The Interpolation: Bernstein Bridges Hoeffding and CLT
Bernstein's Two Regimes
Statement
For i.i.d. variables with and , the Bernstein bound for the sample mean behaves as:
Small deviations (): The denominator , and:
This matches the CLT-type Gaussian tail with the true variance.
Large deviations (): The denominator , and:
This is a sub-exponential (Poisson-like) tail, linear in .
Intuition
Bernstein continuously interpolates between two behaviors. Near the mean (small ), the bound is Gaussian with the correct variance. This matches what the CLT gives you, but with an explicit finite-sample bound. Far from the mean (large ), the bound transitions to a sub-exponential tail, slower than Gaussian, but still exponential.
Hoeffding, by contrast, gives a single Gaussian-like exponent everywhere, but with the worst-case variance instead of the true variance.
What to Memorize
For a qualifying exam or rapid derivation, commit these to memory:
-
Hoeffding (i.i.d., ):
-
Bernstein (i.i.d., ):
-
The decision rule: If you know the variance and it is much smaller than the range squared, use Bernstein. Otherwise, Hoeffding is simpler and gives the same order.
-
The punchline: Bernstein is never worse than Hoeffding (up to constants). In the worst case (), they give the same rate.
When a Researcher Would Use Each
Learning theory: ERM bound for finite classes
The standard ERM generalization bound uses Hoeffding because the loss is bounded in and you do not typically have a variance bound for the loss of each hypothesis. Hoeffding is clean and sufficient.
Sparse estimation: most entries are zero
If your data is mostly zeros (e.g., rare event counting, sparse features), the variance is much smaller than the range. Use Bernstein to exploit the small variance. This arises in importance sampling, where weights can have large range but low variance under a good proposal distribution.
Empirical Bernstein bounds
In practice, you often do not know . The empirical Bernstein bound replaces with the sample variance and adds a correction term. This requires Bernstein applied to the variance estimate itself, plus a union bound. It is more complex but adaptive.
Common Confusions
Bernstein is not always dramatically better
The dramatic improvements (10x or more) happen only when . If the variance is close to its maximum possible value for the given range (e.g., a Bernoulli with , where ), Bernstein and Hoeffding give comparable bounds. Do not reach for Bernstein unless you have reason to believe the variance is small.
The constants matter at small sample sizes
Both bounds have constants (the 2 in Hoeffding's exponent, the and in Bernstein's). At small or small , these constants can make one bound tighter than the other even in regimes where the other is asymptotically superior. Always check the actual numerical bound, not just the asymptotic rate.