Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

Hoeffding vs. Bernstein Inequality

When to use range-only bounds vs. variance-aware bounds: Bernstein is tighter when variance is small, Hoeffding is simpler and sufficient when it is not.

What Each Measures

Both Hoeffding and Bernstein bound the tail probability Pr[Xˉnμt]\Pr[|\bar{X}_n - \mu| \geq t] for the sample mean of independent random variables. They answer the same question: how far can the empirical average stray from the true mean? But they use different information and give different answers.

Hoeffding uses only the bounded range [ai,bi][a_i, b_i] of each variable.

Bernstein uses both the bounded range and the variance σi2\sigma_i^2.

Side-by-Side Statement

Definition

Hoeffding's Inequality

If aiXibia_i \leq X_i \leq b_i are independent, then:

Pr[Xˉnμt]2exp ⁣(2n2t2i=1n(biai)2)\Pr[|\bar{X}_n - \mu| \geq t] \leq 2\exp\!\left(-\frac{2n^2 t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right)

For i.i.d. variables in [0,1][0, 1]:

Pr[Xˉnμt]2e2nt2\Pr[|\bar{X}_n - \mu| \geq t] \leq 2e^{-2nt^2}

Definition

Bernstein's Inequality

If XiμiM|X_i - \mu_i| \leq M are independent with variance σi2\sigma_i^2, then:

Pr ⁣[i=1n(Xiμi)t]2exp ⁣(t2/2iσi2+Mt/3)\Pr\!\left[\left|\sum_{i=1}^n (X_i - \mu_i)\right| \geq t\right] \leq 2\exp\!\left(-\frac{t^2/2}{\sum_i \sigma_i^2 + Mt/3}\right)

For the i.i.d. sample mean with variance σ2\sigma^2:

Pr[Xˉnμt]2exp ⁣(nt2/2σ2+Mt/3)\Pr[|\bar{X}_n - \mu| \geq t] \leq 2\exp\!\left(-\frac{nt^2/2}{\sigma^2 + Mt/3}\right)

Where Each Is Stronger

Hoeffding wins on simplicity

Hoeffding requires only one piece of information: the range [a,b][a, b]. There is no variance to estimate, no extra parameter to tune. If you have bounded random variables and no variance information, Hoeffding is the right tool. The bound is clean, easy to invert, and the proof is short.

Bernstein wins on tightness when variance is small

Consider a random variable X[0,1]X \in [0, 1] with E[X]=0.01\mathbb{E}[X] = 0.01. The variance is at most σ2=p(1p)0.01\sigma^2 = p(1-p) \approx 0.01, but the range is [0,1][0, 1], so (ba)2=1(b - a)^2 = 1. Hoeffding treats the variable as if it could take any value in [0,1][0, 1] with equal ease. Bernstein knows the variable is usually near zero and penalizes accordingly.

Concretely, for nn i.i.d. copies with σ2=0.01\sigma^2 = 0.01 and M=1M = 1:

Deviation ttHoeffding exponentBernstein exponent
0.012n(0.01)2=0.0002n-2n(0.01)^2 = -0.0002nn(0.01)2/(20.01+0.01/3)0.0043n-n(0.01)^2/(2 \cdot 0.01 + 0.01/3) \approx -0.0043n
0.12n(0.01)=0.02n-2n(0.01) = -0.02nn(0.01)/(0.02+0.033)0.19n-n(0.01)/(0.02 + 0.033) \approx -0.19n

Bernstein gives roughly 10 to 20x tighter exponents here. The improvement grows as the variance shrinks relative to the range.

Where Each Fails

Hoeffding fails when the range is loose

If X[100,100]X \in [-100, 100] but Var(X)=1\text{Var}(X) = 1, Hoeffding treats the variable as if it had variance (200)2/4=10,000(200)^2/4 = 10{,}000. The bound is 10,000 times looser than necessary. Bernstein would use σ2=1\sigma^2 = 1 and give a bound that is orders of magnitude tighter.

Bernstein fails when variance is unknown

Bernstein requires knowing (or upper-bounding) the variance. In many settings you have bounded data but no reliable variance estimate. You can estimate variance from data, but then you need a concentration bound on the variance estimate itself, creating a circular dependency that adds technical complexity.

Both fail for unbounded random variables

Neither Hoeffding nor Bernstein applies to unbounded random variables. For Gaussians and other unbounded distributions, you need the sub-Gaussian or sub-exponential framework, which generalizes both inequalities.

Key Assumptions That Differ

HoeffdingBernstein
InputRange [ai,bi][a_i, b_i]Range MM and variance σi2\sigma_i^2
IndependenceRequiredRequired
BoundednessRequired (aiXibia_i \leq X_i \leq b_i)Required ($
VarianceNot usedUsed explicitly
Proof techniqueMGF + Hoeffding's lemmaMGF + Bernstein's condition

The Interpolation: Bernstein Bridges Hoeffding and CLT

Theorem

Bernstein's Two Regimes

Statement

For i.i.d. variables with XiμM|X_i - \mu| \leq M and Var(Xi)=σ2\text{Var}(X_i) = \sigma^2, the Bernstein bound for the sample mean behaves as:

Small deviations (tσ2/Mt \ll \sigma^2/M): The denominator σ2+Mt/3σ2\sigma^2 + Mt/3 \approx \sigma^2, and:

Pr[Xˉnμt]2exp ⁣(nt22σ2)\Pr[|\bar{X}_n - \mu| \geq t] \lesssim 2\exp\!\left(-\frac{nt^2}{2\sigma^2}\right)

This matches the CLT-type Gaussian tail with the true variance.

Large deviations (tσ2/Mt \gg \sigma^2/M): The denominator σ2+Mt/3Mt/3\sigma^2 + Mt/3 \approx Mt/3, and:

Pr[Xˉnμt]2exp ⁣(3nt2M)\Pr[|\bar{X}_n - \mu| \geq t] \lesssim 2\exp\!\left(-\frac{3nt}{2M}\right)

This is a sub-exponential (Poisson-like) tail, linear in tt.

Intuition

Bernstein continuously interpolates between two behaviors. Near the mean (small tt), the bound is Gaussian with the correct variance. This matches what the CLT gives you, but with an explicit finite-sample bound. Far from the mean (large tt), the bound transitions to a sub-exponential tail, slower than Gaussian, but still exponential.

Hoeffding, by contrast, gives a single Gaussian-like exponent everywhere, but with the worst-case variance (ba)2/4(b-a)^2/4 instead of the true variance.

What to Memorize

For a qualifying exam or rapid derivation, commit these to memory:

  1. Hoeffding (i.i.d., [0,1][0,1]): Pr[Xˉnμt]2e2nt2\Pr[|\bar{X}_n - \mu| \geq t] \leq 2e^{-2nt^2}

  2. Bernstein (i.i.d., XμM|X - \mu| \leq M): Pr[Xˉnμt]2exp ⁣(nt2/2σ2+Mt/3)\Pr[|\bar{X}_n - \mu| \geq t] \leq 2\exp\!\left(-\frac{nt^2/2}{\sigma^2 + Mt/3}\right)

  3. The decision rule: If you know the variance and it is much smaller than the range squared, use Bernstein. Otherwise, Hoeffding is simpler and gives the same order.

  4. The punchline: Bernstein is never worse than Hoeffding (up to constants). In the worst case (σ2=(ba)2/4\sigma^2 = (b-a)^2/4), they give the same rate.

When a Researcher Would Use Each

Example

Learning theory: ERM bound for finite classes

The standard ERM generalization bound uses Hoeffding because the loss is bounded in [0,1][0, 1] and you do not typically have a variance bound for the loss of each hypothesis. Hoeffding is clean and sufficient.

Example

Sparse estimation: most entries are zero

If your data is mostly zeros (e.g., rare event counting, sparse features), the variance σ2\sigma^2 is much smaller than the range. Use Bernstein to exploit the small variance. This arises in importance sampling, where weights can have large range but low variance under a good proposal distribution.

Example

Empirical Bernstein bounds

In practice, you often do not know σ2\sigma^2. The empirical Bernstein bound replaces σ2\sigma^2 with the sample variance σ^2\hat{\sigma}^2 and adds a correction term. This requires Bernstein applied to the variance estimate itself, plus a union bound. It is more complex but adaptive.

Common Confusions

Watch Out

Bernstein is not always dramatically better

The dramatic improvements (10x or more) happen only when σ2(ba)2/4\sigma^2 \ll (b - a)^2/4. If the variance is close to its maximum possible value for the given range (e.g., a Bernoulli with p=1/2p = 1/2, where σ2=1/4=(ba)2/4\sigma^2 = 1/4 = (b-a)^2/4), Bernstein and Hoeffding give comparable bounds. Do not reach for Bernstein unless you have reason to believe the variance is small.

Watch Out

The constants matter at small sample sizes

Both bounds have constants (the 2 in Hoeffding's exponent, the 1/21/2 and 1/31/3 in Bernstein's). At small nn or small tt, these constants can make one bound tighter than the other even in regimes where the other is asymptotically superior. Always check the actual numerical bound, not just the asymptotic rate.