Hoeffding vs. Bernstein Inequality. When to Use Each

What Each Measures

Both Hoeffding and Bernstein bound the tail probability $\Pr[|\bar{X}_n - \mu| \geq t]$ for the sample mean of independent random variables. They answer the same question: how far can the empirical average stray from the true mean? But they use different information and give different answers.

Hoeffding uses only the bounded range $[a_i, b_i]$ of each variable.

Bernstein uses both the bounded range and the variance $\sigma_i^2$ .

Side-by-Side Statement

Definition

Hoeffding's Inequality

If $a_i \leq X_i \leq b_i$ are independent, then:

$\Pr[|\bar{X}_n - \mu| \geq t] \leq 2\exp\!\left(-\frac{2n^2 t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right)$

For i.i.d. variables in $[0, 1]$ :

$\Pr[|\bar{X}_n - \mu| \geq t] \leq 2e^{-2nt^2}$

Definition

Bernstein's Inequality

If $|X_i - \mu_i| \leq M$ are independent with variance $\sigma_i^2$ , then:

$\Pr\!\left[\left|\sum_{i=1}^n (X_i - \mu_i)\right| \geq t\right] \leq 2\exp\!\left(-\frac{t^2/2}{\sum_i \sigma_i^2 + Mt/3}\right)$

For the i.i.d. sample mean with variance $\sigma^2$ :

$\Pr[|\bar{X}_n - \mu| \geq t] \leq 2\exp\!\left(-\frac{nt^2/2}{\sigma^2 + Mt/3}\right)$

Where Each Is Stronger

Hoeffding wins on simplicity

Hoeffding requires only one piece of information: the range $[a, b]$ . There is no variance to estimate, no extra parameter to tune. If you have bounded random variables and no variance information, Hoeffding is the right tool. The bound is clean, easy to invert, and the proof is short.

Bernstein wins on tightness when variance is small

Consider a random variable $X \in [0, 1]$ with $\mathbb{E}[X] = 0.01$ . The variance is at most $\sigma^2 = p(1-p) \approx 0.01$ , but the range is $[0, 1]$ , so $(b - a)^2 = 1$ . Hoeffding treats the variable as if it could take any value in $[0, 1]$ with equal ease. Bernstein knows the variable is usually near zero and penalizes accordingly.

Concretely, for $n$ i.i.d. copies with $\sigma^2 = 0.01$ and $M = 1$ :

Bernstein's exponent in the form $\frac{nt^2/2}{\sigma^2 + Mt/3}$ equals $\frac{nt^2}{2\sigma^2 + 2Mt/3}$ after clearing the half. The table reports that latter form so the algebra stays explicit:

Deviation $t$	Hoeffding exponent	Bernstein exponent
0.01	$-2n(0.01)^2 = -0.0002n$	$-n(0.01)^2 / (2 \cdot 0.01 + 2 \cdot 0.01/3) \approx -0.00375n$
0.1	$-2n(0.01) = -0.02n$	$-n(0.01) / (2 \cdot 0.01 + 2 \cdot 0.1/3) \approx -0.115n$

Bernstein gives roughly 6x to 19x tighter exponents here, with the larger gain at small $t$ where the $\sigma^2$ term dominates the denominator. The improvement grows as the variance shrinks relative to the range.

Where Each Fails

Hoeffding fails when the range is loose

If $X \in [-100, 100]$ but $\text{Var}(X) = 1$ , Hoeffding treats the variable as if it had variance $(200)^2/4 = 10{,}000$ . The bound is 10,000 times looser than necessary. Bernstein would use $\sigma^2 = 1$ and give a bound that is orders of magnitude tighter.

Bernstein fails when variance is unknown

Bernstein requires knowing (or upper-bounding) the variance. In many settings you have bounded data but no reliable variance estimate. You can estimate variance from data, but then you need a concentration bound on the variance estimate itself, creating a circular dependency that adds technical complexity.

Both fail for unbounded random variables

Neither Hoeffding nor Bernstein applies to unbounded random variables. For Gaussians and other unbounded distributions, you need the sub-Gaussian or sub-exponential framework, which generalizes both inequalities.

Key Assumptions That Differ

	Hoeffding	Bernstein
Input	Range $[a_i, b_i]$	Range $M$ and variance $\sigma_i^2$
Independence	Required	Required
Boundedness	Required ( $a_i \leq X_i \leq b_i$ )	Required ( $\lvert X_i - \mu_i \rvert \leq M$ )
Variance	Not used	Used explicitly
Proof technique	MGF + Hoeffding's lemma	MGF + Bernstein's condition

The Interpolation: Bernstein Bridges Hoeffding and CLT

Proposition

Bernstein's Two Regimes

Statement

For i.i.d. variables with $|X_i - \mu| \leq M$ and $\text{Var}(X_i) = \sigma^2$ , the Bernstein bound for the sample mean behaves as:

Small deviations ( $t \ll \sigma^2/M$ ): The denominator $\sigma^2 + Mt/3 \approx \sigma^2$ , and:

$\Pr[|\bar{X}_n - \mu| \geq t] \lesssim 2\exp\!\left(-\frac{nt^2}{2\sigma^2}\right)$

This matches the CLT-type Gaussian tail with the true variance.

Large deviations ( $t \gg \sigma^2/M$ ): The denominator $\sigma^2 + Mt/3 \approx Mt/3$ , and:

$\Pr[|\bar{X}_n - \mu| \geq t] \lesssim 2\exp\!\left(-\frac{3nt}{2M}\right)$

This is a sub-exponential (Poisson-like) tail, linear in $t$ .

Intuition

Bernstein continuously interpolates between two behaviors. Near the mean (small $t$ ), the bound is Gaussian with the correct variance. This matches what the CLT gives you, but with an explicit finite-sample bound. Far from the mean (large $t$ ), the bound transitions to a sub-exponential tail, slower than Gaussian, but still exponential.

Hoeffding, by contrast, gives a single Gaussian-like exponent everywhere, but with the worst-case variance $(b-a)^2/4$ instead of the true variance.

report a correction →

What to Memorize

For a qualifying exam or rapid derivation, commit these to memory:

Hoeffding (i.i.d., $[0,1]$ ): $\Pr[|\bar{X}_n - \mu| \geq t] \leq 2e^{-2nt^2}$
Bernstein (i.i.d., $|X - \mu| \leq M$ ): $\Pr[|\bar{X}_n - \mu| \geq t] \leq 2\exp\!\left(-\frac{nt^2/2}{\sigma^2 + Mt/3}\right)$
The decision rule: If you know the variance and it is much smaller than the range squared, use Bernstein. Otherwise, Hoeffding is simpler and gives the same order.
The punchline: Bernstein is never worse than Hoeffding (up to constants). In the worst case ( $\sigma^2 = (b-a)^2/4$ ), they give the same rate.

When a Researcher Would Use Each

Example

Learning theory: ERM bound for finite classes

The standard ERM generalization bound uses Hoeffding because the loss is bounded in $[0, 1]$ and you do not typically have a variance bound for the loss of each hypothesis. Hoeffding is clean and sufficient.

Example

Sparse estimation: most entries are zero

If your data is mostly zeros (e.g., rare event counting, sparse features), the variance $\sigma^2$ is much smaller than the range. Use Bernstein to exploit the small variance. This arises in importance sampling, where weights can have large range but low variance under a good proposal distribution.

Example

Empirical Bernstein bounds

In practice, you often do not know $\sigma^2$ . The empirical Bernstein bound replaces $\sigma^2$ with the sample variance $\hat{\sigma}^2$ and adds a correction term. This requires Bernstein applied to the variance estimate itself, plus a union bound. It is more complex but adaptive.

Common Confusions

Watch Out

Bernstein is not always dramatically better

The dramatic improvements (10x or more) happen only when $\sigma^2 \ll (b - a)^2/4$ . If the variance is close to its maximum possible value for the given range (e.g., a Bernoulli with $p = 1/2$ , where $\sigma^2 = 1/4 = (b-a)^2/4$ ), Bernstein and Hoeffding give comparable bounds. Do not reach for Bernstein unless you have reason to believe the variance is small.

Watch Out

The constants matter at small sample sizes

Both bounds have constants (the 2 in Hoeffding's exponent, the $1/2$ and $1/3$ in Bernstein's). At small $n$ or small $t$ , these constants can make one bound tighter than the other even in regimes where the other is asymptotically superior. Always check the actual numerical bound, not just the asymptotic rate.