Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Concentration Probability

Chernoff Bounds

The Chernoff method: the universal technique for deriving exponential tail bounds by optimizing over the moment generating function, yielding the tightest possible exponential concentration.

CoreTier 1Stable~45 min

Why This Matters

The Chernoff method is not a single inequality. It is a technique, and it is the single most important proof technique in concentration inequalities. Every exponential tail bound you have seen (Hoeffding, Bernstein, sub-Gaussian bounds, Bennett) is derived by the same recipe:

  1. Exponentiate: convert Pr[Xt]\Pr[X \geq t] into Pr[esXest]\Pr[e^{sX} \geq e^{st}]
  2. Apply Markov: estE[esX]\leq e^{-st} \mathbb{E}[e^{sX}]
  3. Optimize: choose s>0s > 0 to minimize the bound

This three-step recipe. The Chernoff method. produces the tightest exponential bound achievable from the moment generating function. If you master this one idea, you can derive Hoeffding, Bernstein, and sub-Gaussian bounds from scratch.

Mental Model

Think of the Chernoff method as applying Markov's inequality to the "best possible" monotone function of XX. Markov says Pr[g(X)g(t)]E[g(X)]/g(t)\Pr[g(X) \geq g(t)] \leq \mathbb{E}[g(X)]/g(t) for any non-negative increasing gg. The exponential g(x)=esxg(x) = e^{sx} is the optimal choice because it grows the fastest (among functions whose expectation is finite), penalizing large values of XX most aggressively. The free parameter s>0s > 0 lets you tune the exponential to the specific threshold tt.

Why is this tighter than Chebyshev? Chebyshev uses g(x)=x2g(x) = x^2, which gives polynomial tails (1/t21/t^2). The exponential g(x)=esxg(x) = e^{sx} gives exponential tails (ecte^{-ct} or ect2e^{-ct^2}), which are dramatically better for large tt.

Formal Setup and Notation

Let XX be a real-valued random variable with moment generating function (MGF) MX(s)=E[esX]M_X(s) = \mathbb{E}[e^{sX}].

Definition

Moment Generating Function

The moment generating function of XX is MX(s)=E[esX]M_X(s) = \mathbb{E}[e^{sX}] for those sRs \in \mathbb{R} where the expectation is finite. The MGF "generates" moments: E[Xk]=MX(k)(0)\mathbb{E}[X^k] = M_X^{(k)}(0). The existence of the MGF in a neighborhood of zero is equivalent to the distribution having exponentially decaying tails.

Definition

Log-Moment Generating Function (Cumulant Generating Function)

The log-MGF or cumulant generating function is ΛX(s)=logE[esX]\Lambda_X(s) = \log \mathbb{E}[e^{sX}]. It is convex in ss. The Chernoff bound becomes: Pr[Xt]exp(infs>0[ΛX(s)st])\Pr[X \geq t] \leq \exp(\inf_{s > 0}[\Lambda_X(s) - st]). The Legendre transform ΛX(t)=sups>0[stΛX(s)]\Lambda_X^*(t) = \sup_{s > 0}[st - \Lambda_X(s)] is the rate function of large deviations theory.

Core Method

Theorem

The Chernoff Method (General Upper Tail)

Statement

For any random variable XX whose MGF MX(s)=E[esX]M_X(s) = \mathbb{E}[e^{sX}] exists for some s>0s > 0, and any tRt \in \mathbb{R}:

Pr[Xt]infs>0estMX(s)=infs>0exp ⁣(ΛX(s)st)=exp ⁣(ΛX(t))\Pr[X \geq t] \leq \inf_{s > 0} e^{-st} M_X(s) = \inf_{s > 0} \exp\!\bigl(\Lambda_X(s) - st\bigr) = \exp\!\bigl(-\Lambda_X^*(t)\bigr)

where ΛX(t)=sups>0[stΛX(s)]\Lambda_X^*(t) = \sup_{s > 0}[st - \Lambda_X(s)] is the Fenchel-Legendre dual of the log-MGF.

Intuition

The Chernoff method converts a tail probability into an optimization problem. For each s>0s > 0, the bound estMX(s)e^{-st}M_X(s) is valid. Different values of ss give different bounds, and you pick the best one. The optimal ss^* satisfies ΛX(s)=t\Lambda_X'(s^*) = t: the tilted distribution has mean exactly tt.

Geometrically: ΛX(t)\Lambda_X^*(t) is the Legendre transform of ΛX(s)\Lambda_X(s), which measures how "surprising" the event {Xt}\{X \geq t\} is relative to the distribution of XX. Larger ΛX(t)\Lambda_X^*(t) means more surprising, hence less probable.

Proof Sketch

For any s>0s > 0: Pr[Xt]=Pr[esXest]E[esX]est=estMX(s)\Pr[X \geq t] = \Pr[e^{sX} \geq e^{st}] \leq \frac{\mathbb{E}[e^{sX}]}{e^{st}} = e^{-st} M_X(s)

The first equality is because xesxx \mapsto e^{sx} is strictly increasing. The inequality is Markov applied to the non-negative random variable esXe^{sX}. Since this holds for all s>0s > 0, take the infimum.

Why It Matters

The Chernoff method is a meta-theorem: it transforms the problem of bounding tails into the problem of bounding the MGF, which is often easier.

Different MGF bounds lead to different named inequalities:

  • If MX(s)es2σ2/2M_X(s) \leq e^{s^2\sigma^2/2}: sub-Gaussian bound (Hoeffding-type)
  • If MX(s)es2σ2/(2(1sM/3))M_X(s) \leq e^{s^2\sigma^2/(2(1-s M/3))}: Bernstein-type bound
  • If MX(s)=(1p+pes)nM_X(s) = (1-p+pe^s)^n: multiplicative Chernoff for Binomials
  • If MX(s)=(1s)1M_X(s) = (1-s)^{-1}: exponential tail bound

Every exponential concentration inequality is a Chernoff bound with a specific MGF estimate plugged in.

Failure Mode

The Chernoff method requires the MGF to exist in a neighborhood of s=0s = 0. For heavy-tailed distributions (Pareto, Cauchy), the MGF is infinite for all s>0s > 0, and the Chernoff method gives nothing. For these, you need moment-based methods (Chebyshev for 1/t21/t^2, or higher-moment bounds for 1/tp1/t^p).

Multiplicative Chernoff Bounds

The most important special case: sums of independent Bernoulli (or more generally, [0,1][0,1]-valued) random variables. The "multiplicative" form expresses deviations as a fraction of the mean, which gives tighter bounds than additive Hoeffding when the mean is small.

Theorem

Multiplicative Chernoff Bound

Statement

Let X1,,XnX_1, \ldots, X_n be independent random variables with Xi[0,1]X_i \in [0, 1]. Let S=i=1nXiS = \sum_{i=1}^n X_i and μ=E[S]\mu = \mathbb{E}[S]. Then:

Upper tail: For δ>0\delta > 0: Pr[S(1+δ)μ](eδ(1+δ)(1+δ))μ\Pr[S \geq (1+\delta)\mu] \leq \left(\frac{e^\delta}{(1+\delta)^{(1+\delta)}}\right)^\mu

Simplified upper tail: For δ(0,1]\delta \in (0, 1]: Pr[S(1+δ)μ]exp ⁣(μδ23)\Pr[S \geq (1+\delta)\mu] \leq \exp\!\left(-\frac{\mu\delta^2}{3}\right)

Lower tail: For δ(0,1)\delta \in (0, 1): Pr[S(1δ)μ]exp ⁣(μδ22)\Pr[S \leq (1-\delta)\mu] \leq \exp\!\left(-\frac{\mu\delta^2}{2}\right)

Intuition

The multiplicative form measures deviations relative to the mean. If μ=10\mu = 10, a deviation of S=15S = 15 is a 50% overshoot (δ=0.5\delta = 0.5). The bound eμδ2/3e^{-\mu\delta^2/3} depends on μδ2\mu\delta^2: the product of the mean and the squared relative deviation.

Why is this better than Hoeffding for small means? Hoeffding gives Pr[Sμ+t]e2t2/n\Pr[S \geq \mu + t] \leq e^{-2t^2/n}. With t=δμt = \delta\mu and small μ\mu: Hoeffding gives e2δ2μ2/ne^{-2\delta^2\mu^2/n} while multiplicative Chernoff gives eδ2μ/3e^{-\delta^2\mu/3}. When μn\mu \ll n (rare events), Chernoff is dramatically tighter.

Proof Sketch

Step 1: Apply the Chernoff method to SS with parameter s>0s > 0: Pr[S(1+δ)μ]es(1+δ)μi=1nE[esXi]\Pr[S \geq (1+\delta)\mu] \leq e^{-s(1+\delta)\mu} \prod_{i=1}^n \mathbb{E}[e^{sX_i}].

Step 2: Bound the MGF of each XiX_i: since Xi[0,1]X_i \in [0,1], E[esXi]1+E[Xi](es1)exp(E[Xi](es1))\mathbb{E}[e^{sX_i}] \leq 1 + \mathbb{E}[X_i](e^s - 1) \leq \exp(\mathbb{E}[X_i](e^s - 1)) using 1+xex1 + x \leq e^x.

Step 3: Multiply: iE[esXi]exp(μ(es1))\prod_i \mathbb{E}[e^{sX_i}] \leq \exp(\mu(e^s - 1)).

Step 4: Total bound: exp(μ(es1)s(1+δ)μ)\exp(\mu(e^s - 1) - s(1+\delta)\mu).

Step 5: Optimize: set s=ln(1+δ)s = \ln(1+\delta) to get exp(μ(δ(1+δ)ln(1+δ)))\exp(\mu(\delta - (1+\delta)\ln(1+\delta))). The simplified form follows from the inequality (1+δ)ln(1+δ)δ+δ2/3(1+\delta)\ln(1+\delta) \geq \delta + \delta^2/3 for δ(0,1]\delta \in (0, 1].

Why It Matters

The multiplicative Chernoff bound is the standard tool for:

  • Randomized algorithms: bounding the probability that a random hash function has too many collisions
  • Sampling: how many samples to estimate a proportion within relative error δ\delta
  • Network analysis: bounding node degrees in random graphs
  • Binomial concentration: any setting where you sum independent 0/1 indicators and the expected sum may be small

The multiplicative form is especially important when μ=o(n)\mu = o(n): for rare events, Hoeffding wastes a factor of n/μn/\mu in the exponent.

Failure Mode

The simplified form eμδ2/3e^{-\mu\delta^2/3} is valid only for δ(0,1]\delta \in (0,1] (at most doubling the mean). For δ>1\delta > 1, you must use the exact form or the weaker bound Pr[St]eμ(eμ/t)t\Pr[S \geq t] \leq e^{-\mu} \cdot (e\mu/t)^t for t2eμt \geq 2e\mu. Also, the bound requires independence; for dependent indicators, you need Azuma-Hoeffding or other martingale methods.

Why Chernoff Gives the Tightest Exponential Bounds

The Chernoff bound Pr[Xt]eΛ(t)\Pr[X \geq t] \leq e^{-\Lambda^*(t)} is the best possible exponential bound derivable from the MGF. More precisely:

Claim. For any random variable XX, among all bounds of the form Pr[Xt]er(t)\Pr[X \geq t] \leq e^{-r(t)} that hold for all distributions with the same MGF, the Chernoff bound with r(t)=ΛX(t)r(t) = \Lambda_X^*(t) is optimal.

Why. By the Legendre transform, ΛX(t)=sups[stΛX(s)]\Lambda_X^*(t) = \sup_s [st - \Lambda_X(s)]. Any valid exponential bound based on the MGF must have r(t)stΛX(s)r(t) \leq st - \Lambda_X(s) for the worst-case ss. Taking the supremum over ss gives exactly ΛX(t)\Lambda_X^*(t).

Connection to large deviations. For i.i.d. sums Sn=X1++XnS_n = X_1 + \cdots + X_n, the Chernoff bound gives Pr[Sn/nt]enΛ(t)\Pr[S_n/n \geq t] \leq e^{-n\Lambda^*(t)}. The remarkable fact (Cramér's theorem) is that this is asymptotically exact:

limn1nlogPr[Sn/nt]=Λ(t)\lim_{n \to \infty} \frac{1}{n}\log \Pr[S_n/n \geq t] = -\Lambda^*(t)

So the Chernoff method not only gives an upper bound. It finds the correct exponential rate. This is the starting point of large deviations theory.

Proof Ideas and Templates Used

The Chernoff method is a proof template consisting of three steps:

  1. Exponentiate: introduce the free parameter ss
  2. Apply Markov: convert tail probability to MGF
  3. Optimize: choose ss to minimize the bound

This template is universal. The art lies in step 2.5: bounding the MGF. Different MGF bounds produce different named inequalities:

MGF boundNamed resultTail type
Hoeffding's lemma (X[a,b]X \in [a,b])Hoeffding's inequalityect2e^{-ct^2}
Bernstein condition (variance + bound)Bernstein's inequalityect2/(v+Mt)e^{-ct^2/(v+Mt)}
Sub-Gaussian definitionSub-Gaussian tailect2e^{-ct^2}
Exact Binomial MGFMultiplicative Chernoffecμδ2e^{-c\mu\delta^2}

Canonical Examples

Example

Chernoff for a standard Gaussian

Let XN(0,1)X \sim \mathcal{N}(0, 1). The MGF is MX(s)=es2/2M_X(s) = e^{s^2/2}. The Chernoff bound gives:

Pr[Xt]infs>0est+s2/2\Pr[X \geq t] \leq \inf_{s > 0} e^{-st + s^2/2}

Optimize: d/ds(st+s2/2)=0d/ds(-st + s^2/2) = 0 gives s=ts^* = t. So Pr[Xt]et2/2\Pr[X \geq t] \leq e^{-t^2/2}.

This is the standard Gaussian tail bound. The exact Gaussian tail is Pr[Xt]=12πtex2/2dxet2/2t2π\Pr[X \geq t] = \frac{1}{\sqrt{2\pi}} \int_t^\infty e^{-x^2/2}\,dx \approx \frac{e^{-t^2/2}}{t\sqrt{2\pi}}, so Chernoff is tight up to the polynomial factor 1/(t2π)1/(t\sqrt{2\pi}).

Example

Estimating a rare event probability

Flip a coin with heads probability p=0.01p = 0.01 a total of n=1000n = 1000 times. Let SS = number of heads, so μ=E[S]=10\mu = \mathbb{E}[S] = 10.

What is Pr[S20]\Pr[S \geq 20]? This is Pr[S2μ]\Pr[S \geq 2\mu], i.e., δ=1\delta = 1.

Multiplicative Chernoff: Pr[S20]eμ/3=e10/30.036\Pr[S \geq 20] \leq e^{-\mu/3} = e^{-10/3} \approx 0.036.

Hoeffding: Pr[S20]=Pr[Xˉp0.01]e2n(0.01)2=e0.20.82\Pr[S \geq 20] = \Pr[\bar{X} - p \geq 0.01] \leq e^{-2n(0.01)^2} = e^{-0.2} \approx 0.82.

Chernoff gives 3.6%3.6\% vs. Hoeffding's 82%82\%. The multiplicative Chernoff is dramatically tighter because it uses the fact that μ\mu is small relative to nn.

Common Confusions

Watch Out

Chernoff is a method, not a single inequality

When people say "Chernoff bound," they sometimes mean the general method (optimize over ss in the MGF bound) and sometimes mean the specific multiplicative bound for sums of independent [0,1] variables. Be aware of which one is meant. Hoeffding's inequality is a Chernoff bound, derived by plugging Hoeffding's lemma into the Chernoff method. The method is more general than any single application.

Watch Out

The Chernoff bound is one-sided

The basic Chernoff method with s>0s > 0 bounds the upper tail Pr[Xt]\Pr[X \geq t]. For the lower tail Pr[Xt]\Pr[X \leq t], use s<0s < 0 (equivalently, apply the upper-tail bound to X-X). For two-sided bounds, combine the two one-sided bounds via a union bound, picking up a factor of 2.

Watch Out

MGF must exist for Chernoff to work

The Chernoff method requires MX(s)<M_X(s) < \infty for some s>0s > 0. This fails for heavy-tailed distributions: the Cauchy distribution has no finite moments at all, and the exponential distribution has MX(s)=M_X(s) = \infty for s1s \geq 1. For exponential random variables, you can still apply Chernoff but only for s<1s < 1, limiting the range of tt values you can bound.

Connection to Large Deviations: Sanov's Theorem Preview

The Chernoff bound for i.i.d. sums is the starting point of large deviations theory. Cramér's theorem says Pr[Sn/nt]enΛ(t)\Pr[S_n/n \geq t] \approx e^{-n\Lambda^*(t)}, and this extends to a much more general setting.

Sanov's theorem generalizes this to empirical distributions: if P^n\hat{P}_n is the empirical distribution of nn i.i.d. draws from PP, then the probability that P^n\hat{P}_n lands in a set E\mathcal{E} of distributions decays as eninfQEDKL(QP)e^{-n \inf_{Q \in \mathcal{E}} D_{\text{KL}}(Q \| P)}, where DKLD_{\text{KL}} is the KL divergence.

This connects Chernoff bounds to information theory: the "rate" at which empirical observations deviate from the truth is measured by KL divergence.

Summary

  • The Chernoff method: Pr[Xt]estMX(s)\Pr[X \geq t] \leq e^{-st} M_X(s), optimize over s>0s > 0
  • This gives the tightest exponential bound from the MGF
  • Multiplicative Chernoff for Binomials: Pr[S(1+δ)μ]eμδ2/3\Pr[S \geq (1+\delta)\mu] \leq e^{-\mu\delta^2/3} for δ(0,1]\delta \in (0,1]
  • Multiplicative form is tighter than Hoeffding when μn\mu \ll n
  • Every exponential concentration inequality (Hoeffding, Bernstein, etc.) is a Chernoff bound with a specific MGF estimate
  • The optimal rate Λ(t)\Lambda^*(t) equals the large deviations rate function (Cramér's theorem)
  • Requires the MGF to exist; fails for heavy-tailed distributions

Exercises

ExerciseCore

Problem

Let XExp(1)X \sim \text{Exp}(1) (exponential with rate 1). Use the Chernoff method to derive a tail bound for Pr[Xt]\Pr[X \geq t] for t>1t > 1.

ExerciseCore

Problem

A randomized algorithm succeeds with probability p=0.7p = 0.7 on each independent trial. You run it n=100n = 100 times and take a majority vote. Use the multiplicative Chernoff bound to bound the probability that fewer than 50 trials succeed (i.e., the majority vote fails).

ExerciseAdvanced

Problem

Derive Hoeffding's inequality from the Chernoff method. Specifically, let X1,,XnX_1, \ldots, X_n be independent with Xi[ai,bi]X_i \in [a_i, b_i]. Starting from the Chernoff bound, use Hoeffding's lemma (E[es(Xiμi)]es2(biai)2/8\mathbb{E}[e^{s(X_i - \mu_i)}] \leq e^{s^2(b_i - a_i)^2/8}) to derive:

Pr[Xˉnμt]exp ⁣(2n2t2i(biai)2)\Pr[\bar{X}_n - \mu \geq t] \leq \exp\!\left(-\frac{2n^2t^2}{\sum_i(b_i-a_i)^2}\right)

Related Comparisons

References

Canonical:

  • Mitzenmacher & Upfal, Probability and Computing (2017), Chapter 4
  • Boucheron, Lugosi, Massart, Concentration Inequalities (2013), Chapter 2
  • Vershynin, High-Dimensional Probability (2018), Chapter 2

Current:

  • Motwani & Raghavan, Randomized Algorithms (1995), Chapter 4

  • Wainwright, High-Dimensional Statistics (2019), Chapter 2

  • van Handel, Probability in High Dimension (2016), Chapters 1-3

Next Topics

Building on the Chernoff method:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics