Chernoff Bounds

Sneiderman, Robby

Concentration Probability

Chernoff Bounds

The Chernoff method: the universal technique for deriving exponential tail bounds by optimizing over the moment generating function, yielding the tightest possible exponential concentration.

CoreTier 1StableCore spine~45 min

Prerequisites

Concentration Inequalities Moment Generating Functions

Start 8-question practice · 3 available Prereq Map

Learning position

Read this page in the graph.

concentration-probability | layer 1 | tier 1. This page has 2 direct prerequisites and 5 published dependents.

Open Atlas Prerequisites Leads to

What next

Sub-Gaussian Random Variables

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

theorem visual

Chernoff Tilt

$The upper-tail bound becomes tight by choosing the slope that best tilts the log-MGF.$

Threshold t

t = 1.4

The best exponential upper tail bound comes from choosing the optimizer that minimizes the tilted intercept.

L amb d a (s)

: log-MGFAmber: tilted lineGreen: optimizer and gap

The Chernoff method is not a single inequality. It is a technique, and it is the single most important proof technique in concentration inequalities. Every exponential tail bound you have seen (Hoeffding, Bernstein, sub-Gaussian bounds, Bennett) is derived by the same recipe:

Exponentiate: convert $\Pr[X \geq t]$ into $\Pr[e^{sX} \geq e^{st}]$
Apply Markov: $\leq e^{-st} \mathbb{E}[e^{sX}]$
Optimize: choose $s > 0$ to minimize the bound

This three-step recipe. The Chernoff method. produces the tightest exponential bound achievable from the moment generating function. If you master this one idea, you can derive Hoeffding, Bernstein, and sub-Gaussian bounds from scratch.

Read this page as the operational sequel to Moment Generating Functions: that page explains what the MGF is, and this page shows how the MGF becomes a tail bound after one Markov step and one optimization step.

Quick Version

Exponentiate. Replace $X \ge t$ by $e^{sX} \ge e^{st}$ . This turns a tail event into a non-negative random variable.
Apply Markov. Bound the probability by $e^{-st}\mathbb{E}[e^{sX}]$ . This converts the tail event into an MGF expression.
Optimize over $s$ . Choose the best tilt parameter. This gives the sharpest exponential certificate available from the MGF.

If the MGF is well controlled, Chernoff gives exponential decay. If the MGF does not exist, the method has nothing to optimize and the certificate disappears.

Mental Model

Think of the Chernoff method as applying Markov's inequality to the "best possible" monotone function of $X$ . Markov says $\Pr[g(X) \geq g(t)] \leq \mathbb{E}[g(X)]/g(t)$ for any non-negative increasing $g$ . The exponential $g(x) = e^{sx}$ is the optimal choice because it grows the fastest (among functions whose expectation is finite), penalizing large values of $X$ most aggressively. The free parameter $s > 0$ lets you tune the exponential to the specific threshold $t$ .

Why is this tighter than Chebyshev? Chebyshev uses $g(x) = x^2$ , which gives polynomial tails ( $1/t^2$ ). The exponential $g(x) = e^{sx}$ gives exponential tails ( $e^{-ct}$ or $e^{-ct^2}$ ), which are dramatically better for large $t$ .

The MGF Technique Step-by-Step

The Chernoff method is the proof pattern, not a specific bound. Read each step alongside the prerequisite it depends on; the recall boxes name those prerequisites explicitly so the derivation is auditable end-to-end.

Step 1 — Exponentiate. For $s > 0$ , monotonicity of $u \mapsto e^{su}$ gives

$\Pr[X \geq t] = \Pr[e^{sX} \geq e^{st}].$

The event has not changed; only the algebraic form has. Both sides remain exact equalities, no inequality has been used yet.

Recall: Markov's inequality. For a nonnegative random variable $Y$ and any $a > 0$ , $\Pr[Y \geq a] \leq \mathbb{E}[Y]/a$ . The next step applies this to $Y = e^{sX}$ , which is nonnegative because the exponential is. See Markov's inequality.

Step 2 — Apply Markov. Markov on the nonnegative random variable $e^{sX}$ :

$\Pr[e^{sX} \geq e^{st}] \leq \frac{\mathbb{E}[e^{sX}]}{e^{st}} = e^{-st}\, M_X(s).$

The tail probability is now an MGF expression. The certificate is no longer distribution-specific; it depends only on $M_X$ .

Step 3 — Use independence to factor. When $X = \sum_{i=1}^n X_i$ with the $X_i$ independent,

$M_X(s) = \mathbb{E}\!\left[e^{s\sum_i X_i}\right] = \prod_{i=1}^n \mathbb{E}[e^{sX_i}].$

Each factor is a one-variable MGF that admits a per-summand bound (Hoeffding's lemma for bounded variables, Bernstein's MGF lemma when variance information is available, and so on).

Recall: $1 + x \leq e^x$ . Valid for every $x \in \mathbb{R}$ . The Bernoulli MGF for $X_i \sim \mathrm{Bern}(p_i)$ is $\mathbb{E}[e^{sX_i}] = p_i e^s + (1 - p_i) = 1 + p_i(e^s - 1)$ , which the inequality bounds by $\exp(p_i(e^s - 1))$ . The same bound, with different per-summand MGFs, drives every Chernoff-style proof on this site.

Step 4 — Optimize over $s$ . The bound $e^{-st}M_X(s)$ depends on the free parameter $s$ . Taking the infimum gives the Legendre transform of the log-MGF:

$\Pr[X \geq t] \leq \inf_{s > 0} e^{-st}\, M_X(s) = \exp\!\left(-\sup_{s > 0}\bigl[st - \Lambda_X(s)\bigr]\right) = e^{-\Lambda_X^*(t)}.$

The optimum $s^* > 0$ is characterized by $\Lambda_X'(s^*) = t$ , i.e., the exponentially tilted distribution has mean exactly $t$ . This is the geometric content of large-deviations theory.

Formal Setup and Notation

Let $X$ be a real-valued random variable with moment generating function (MGF) $M_X(s) = \mathbb{E}[e^{sX}]$ .

Definition

Moment Generating Function $M_{X} (s)$

The moment generating function of $X$ is $M_X(s) = \mathbb{E}[e^{sX}]$ for those $s \in \mathbb{R}$ where the expectation is finite. The MGF "generates" moments: $\mathbb{E}[X^k] = M_X^{(k)}(0)$ . The existence of the MGF in a neighborhood of zero is equivalent to the distribution having exponentially decaying tails.

Definition

Log-Moment Generating Function (Cumulant Generating Function) $Λ_{X} (s)$

The log-MGF or cumulant generating function is $\Lambda_X(s) = \log \mathbb{E}[e^{sX}]$ . It is convex in $s$ . The Chernoff bound becomes: $\Pr[X \geq t] \leq \exp(\inf_{s > 0}[\Lambda_X(s) - st])$ . The Legendre transform $\Lambda_X^*(t) = \sup_{s > 0}[st - \Lambda_X(s)]$ is the rate function of large deviations theory.

Core Method

Theorem

The Chernoff Method (General Upper Tail)

Statement

For any random variable $X$ whose MGF $M_X(s) = \mathbb{E}[e^{sX}]$ exists for some $s > 0$ , and any $t \in \mathbb{R}$ :

$\Pr[X \geq t] \leq \inf_{s > 0} e^{-st} M_X(s) = \inf_{s > 0} \exp\!\bigl(\Lambda_X(s) - st\bigr) = \exp\!\bigl(-\Lambda_X^*(t)\bigr)$

where $\Lambda_X^*(t) = \sup_{s > 0}[st - \Lambda_X(s)]$ is the Fenchel-Legendre dual of the log-MGF.

Intuition

The Chernoff method converts a tail probability into an optimization problem. For each $s > 0$ , the bound $e^{-st}M_X(s)$ is valid. Different values of $s$ give different bounds, and you pick the best one. The optimal $s^*$ satisfies $\Lambda_X'(s^*) = t$ : the tilted distribution has mean exactly $t$ .

Geometrically: $\Lambda_X^*(t)$ is the Legendre transform of $\Lambda_X(s)$ , which measures how "surprising" the event $\{X \geq t\}$ is relative to the distribution of $X$ . Larger $\Lambda_X^*(t)$ means more surprising, hence less probable.

Proof Sketch

For any $s > 0$ : $\Pr[X \geq t] = \Pr[e^{sX} \geq e^{st}] \leq \frac{\mathbb{E}[e^{sX}]}{e^{st}} = e^{-st} M_X(s)$

The first equality is because $x \mapsto e^{sx}$ is strictly increasing. The inequality is Markov applied to the non-negative random variable $e^{sX}$ . Since this holds for all $s > 0$ , take the infimum.

Why It Matters

The Chernoff method is a meta-theorem: it transforms the problem of bounding tails into the problem of bounding the MGF, which is often easier.

Different MGF bounds lead to different named inequalities:

If $M_X(s) \leq e^{s^2\sigma^2/2}$ : sub-Gaussian bound (Hoeffding-type)
If $M_X(s) \leq e^{s^2\sigma^2/(2(1-s M/3))}$ : Bernstein-type bound
If $M_X(s) = (1-p+pe^s)^n$ : multiplicative Chernoff for Binomials
If $M_X(s) = (1-s)^{-1}$ : exponential tail bound

Every exponential concentration inequality is a Chernoff bound with a specific MGF estimate plugged in.

Failure Mode

The Chernoff method requires the MGF to exist in a neighborhood of $s = 0$ . For heavy-tailed distributions (Pareto, Cauchy), the MGF is infinite for all $s > 0$ , and the Chernoff method gives nothing. For these, you need moment-based methods (Chebyshev for $1/t^2$ , or higher-moment bounds for $1/t^p$ ).

report a correction →

Multiplicative Chernoff Bounds

The most important special case: sums of independent Bernoulli (or more generally, $[0,1]$ -valued) random variables. The "multiplicative" form expresses deviations as a fraction of the mean, which gives tighter bounds than additive Hoeffding when the mean is small.

Theorem

Multiplicative Chernoff Bound

Statement

Let $X_1, \ldots, X_n$ be independent random variables with $X_i \in [0, 1]$ . Let $S = \sum_{i=1}^n X_i$ and $\mu = \mathbb{E}[S]$ . Then:

Upper tail: For $\delta > 0$ : $\Pr[S \geq (1+\delta)\mu] \leq \left(\frac{e^\delta}{(1+\delta)^{(1+\delta)}}\right)^\mu$

Simplified upper tail: For $\delta \in (0, 1]$ : $\Pr[S \geq (1+\delta)\mu] \leq \exp\!\left(-\frac{\mu\delta^2}{3}\right)$

Lower tail: For $\delta \in (0, 1)$ : $\Pr[S \leq (1-\delta)\mu] \leq \exp\!\left(-\frac{\mu\delta^2}{2}\right)$

Intuition

The multiplicative form measures deviations relative to the mean. If $\mu = 10$ , a deviation of $S = 15$ is a 50% overshoot ( $\delta = 0.5$ ). The bound $e^{-\mu\delta^2/3}$ depends on $\mu\delta^2$ : the product of the mean and the squared relative deviation.

Why is this better than Hoeffding for small means? Hoeffding gives $\Pr[S \geq \mu + t] \leq e^{-2t^2/n}$ . With $t = \delta\mu$ and small $\mu$ : Hoeffding gives $e^{-2\delta^2\mu^2/n}$ while multiplicative Chernoff gives $e^{-\delta^2\mu/3}$ . When $\mu \ll n$ (rare events), Chernoff is dramatically tighter.

Proof Sketch

Step 1: Apply the Chernoff method to $S$ with parameter $s > 0$ : $\Pr[S \geq (1+\delta)\mu] \leq e^{-s(1+\delta)\mu} \prod_{i=1}^n \mathbb{E}[e^{sX_i}]$ .

Step 2: Bound the MGF of each $X_i$ : since $X_i \in [0,1]$ , $\mathbb{E}[e^{sX_i}] \leq 1 + \mathbb{E}[X_i](e^s - 1) \leq \exp(\mathbb{E}[X_i](e^s - 1))$ using $1 + x \leq e^x$ .

Step 3: Multiply: $\prod_i \mathbb{E}[e^{sX_i}] \leq \exp(\mu(e^s - 1))$ .

Step 4: Total bound: $\exp(\mu(e^s - 1) - s(1+\delta)\mu)$ .

Step 5: Optimize: set $s = \ln(1+\delta)$ to get $\exp(\mu(\delta - (1+\delta)\ln(1+\delta)))$ . The simplified form follows from the inequality $(1+\delta)\ln(1+\delta) \geq \delta + \delta^2/3$ for $\delta \in (0, 1]$ .

Why It Matters

The multiplicative Chernoff bound is the standard tool for:

Randomized algorithms: bounding the probability that a random hash function has too many collisions
Sampling: how many samples to estimate a proportion within relative error $\delta$
Network analysis: bounding node degrees in random graphs
Binomial concentration: any setting where you sum independent 0/1 indicators and the expected sum may be small

The multiplicative form is especially important when $\mu = o(n)$ : for rare events, Hoeffding wastes a factor of $n/\mu$ in the exponent.

Failure Mode

The simplified form $e^{-\mu\delta^2/3}$ is valid only for $\delta \in (0,1]$ (at most doubling the mean). For $\delta > 1$ , you must use the exact form or the weaker bound $\Pr[S \geq t] \leq e^{-\mu} \cdot (e\mu/t)^t$ for $t \geq 2e\mu$ . Also, the bound requires independence; for dependent indicators, you need Azuma-Hoeffding or other martingale methods.

report a correction →

Bernstein Inequality

Hoeffding uses only the support $[a_i, b_i]$ of each summand. When the variance $\sigma_i^2 = \operatorname{Var}(X_i)$ is much smaller than $(b_i - a_i)^2$ , Hoeffding is loose. Bernstein's inequality keeps the variance information and produces a tighter bound in exactly this regime.

Theorem

Bernstein's Inequality

Statement

Let $X_1, \ldots, X_n$ be independent random variables with $\mathbb{E}[X_i] = 0$ , $|X_i| \leq M$ almost surely, and $\operatorname{Var}(X_i) = \sigma_i^2$ . Let $v = \sum_{i=1}^n \sigma_i^2$ . Then for all $t \geq 0$ :

$\Pr\!\left[\sum_{i=1}^n X_i \geq t\right] \leq \exp\!\left(-\frac{t^2}{2v + \tfrac{2}{3}Mt}\right)$

Intuition

The bound interpolates between two regimes via the denominator $2v + (2/3)Mt$ :

Small $t$ (variance regime): when $Mt \ll v$ , the denominator is approximately $2v$ and the bound reduces to $\exp(-t^2/(2v))$ , the Gaussian tail controlled by the total variance. This matches what the CLT predicts for sums whose individual summands are small compared to the aggregate spread.
Large $t$ (Poisson/Hoeffding regime): when $Mt \gg v$ , the denominator is dominated by $(2/3)Mt$ and the bound becomes $\exp(-3t/(2M))$ , an exponential tail linear in $t$ . This is the regime where a single bounded summand can contribute most of the deviation.

Bernstein strictly dominates Hoeffding when $v \ll nM^2$ , which is the common case for rare-event indicators, heavy-imbalance Bernoullis, or any setting where most summands have tiny variance.

Proof Sketch

Apply the Chernoff method to $S = \sum_i X_i$ . Control each MGF by the Bernstein condition: for $|s| < 3/M$ ,

$\mathbb{E}[e^{sX_i}] \leq \exp\!\left(\frac{s^2 \sigma_i^2/2}{1 - s M/3}\right)$

Multiply across $i$ and take logs: $\Lambda_S(s) \leq s^2 v / (2(1 - sM/3))$ . Optimize $\inf_{s > 0}(\Lambda_S(s) - st)$ with the substitution $s = t/(v + Mt/3)$ to obtain the stated bound.

Why It Matters

Bernstein is the right tool whenever you know both a uniform bound $M$ and a variance proxy. It is the backbone of empirical process bounds, matrix Bernstein, and sample-complexity arguments in statistical learning where the effective variance can be much smaller than the worst-case range.

Failure Mode

Requires an almost-sure bound $|X_i| \leq M$ . For sub-exponential tails without an almost-sure bound, use the Bernstein-type inequality for sub-exponential variables (which replaces $M$ by the Orlicz parameter). Also, the variance must be finite and correctly estimated; plugging in a loose upper bound on $\sigma_i^2$ weakens the inequality.

report a correction →

Azuma-Hoeffding (Martingales)

Hoeffding requires independence. Many natural sums are not independent but do form a martingale with bounded differences. Azuma-Hoeffding gives a Chernoff-style bound in exactly this setting.

Theorem

Azuma-Hoeffding Inequality

Statement

Let $\{X_k\}_{k=0}^n$ be a martingale (or $\{X_k - X_0\}$ a martingale difference sequence) with $|X_k - X_{k-1}| \leq c_k$ almost surely for each $k$ . Then for all $t \geq 0$ :

$\Pr\!\left[\,|X_n - X_0| \geq t\,\right] \leq 2 \exp\!\left(-\frac{t^2}{2 \sum_{k=1}^n c_k^2}\right)$

Intuition

Define $D_k = X_k - X_{k-1}$ , so $\mathbb{E}[D_k \mid \mathcal{F}_{k-1}] = 0$ and $|D_k| \leq c_k$ . Conditional on the past, $D_k$ behaves like a bounded zero-mean variable, so Hoeffding's lemma applied conditionally gives $\mathbb{E}[e^{sD_k} \mid \mathcal{F}_{k-1}] \leq e^{s^2 c_k^2/2}$ . Tower over $k$ to multiply these bounds, then run the Chernoff method on $X_n - X_0 = \sum_k D_k$ . The exponent $t^2/(2\sum c_k^2)$ is identical to Hoeffding's; only the assumption is weakened from independence to the martingale property.

Proof Sketch

For $s > 0$ , use the tower property:

$\mathbb{E}[e^{s(X_n - X_0)}] = \mathbb{E}\!\left[\mathbb{E}[e^{sD_n} \mid \mathcal{F}_{n-1}] \cdot e^{s(X_{n-1} - X_0)}\right] \leq e^{s^2 c_n^2/2}\, \mathbb{E}[e^{s(X_{n-1} - X_0)}]$

Iterate to get $\mathbb{E}[e^{s(X_n - X_0)}] \leq \exp(s^2 \sum_k c_k^2 / 2)$ . Apply the Chernoff method and optimize $s = t / \sum_k c_k^2$ . A symmetric argument for $-s$ handles the lower tail; the factor of 2 is the union bound.

Why It Matters

Azuma-Hoeffding generalizes Hoeffding beyond independence to any martingale with bounded differences. Typical applications:

McDiarmid's inequality (bounded differences for functions of independent variables) is a direct corollary via the Doob martingale.
Stochastic optimization: convergence rates for online algorithms where iterates depend on all past samples.
Graph and combinatorial concentration: chromatic number, longest increasing subsequence, edge exposure in random graphs.

Failure Mode

Requires an almost-sure bound on each increment. If the differences are only bounded in $L^2$ or in probability, Azuma-Hoeffding does not apply; use Freedman's inequality (a martingale Bernstein) or sub-Gaussian martingale concentration. The bound is loose when the conditional variances $\operatorname{Var}(D_k \mid \mathcal{F}_{k-1})$ are much smaller than $c_k^2$ ; in that case use Freedman or a Bernstein-type martingale bound.

report a correction →

Why Chernoff Gives the Tightest Exponential Bounds

The Chernoff bound $\Pr[X \geq t] \leq e^{-\Lambda^*(t)}$ is the best possible exponential bound derivable from the MGF. More precisely:

Claim. For any random variable $X$ , among all bounds of the form $\Pr[X \geq t] \leq e^{-r(t)}$ that hold for all distributions with the same MGF, the Chernoff bound with $r(t) = \Lambda_X^*(t)$ is optimal.

Why. By the Legendre transform, $\Lambda_X^*(t) = \sup_s [st - \Lambda_X(s)]$ . Any valid exponential bound based on the MGF must have $r(t) \leq st - \Lambda_X(s)$ for the worst-case $s$ . Taking the supremum over $s$ gives exactly $\Lambda_X^*(t)$ .

Connection to large deviations. For i.i.d. sums $S_n = X_1 + \cdots + X_n$ , the Chernoff bound gives $\Pr[S_n/n \geq t] \leq e^{-n\Lambda^*(t)}$ . The remarkable fact (Cramér's theorem) is that this is asymptotically exact:

$\lim_{n \to \infty} \frac{1}{n}\log \Pr[S_n/n \geq t] = -\Lambda^*(t)$

So the Chernoff method not only gives an upper bound. It finds the correct exponential rate. This is the starting point of large deviations theory.

Proof Ideas and Templates Used

The Chernoff method is a proof template consisting of three steps:

Exponentiate: introduce the free parameter $s$
Apply Markov: convert tail probability to MGF
Optimize: choose $s$ to minimize the bound

This template is universal. The art lies in step 2.5: bounding the MGF. Different MGF bounds produce different named inequalities:

Hoeffding's lemma for $X \in [a,b]$ gives Hoeffding's inequality with Gaussian-style decay $e^{-ct^2}$ .
Bernstein's condition with variance and a uniform bound gives Bernstein's inequality with tail shape $e^{-ct^2/(v+Mt)}$ .
A sub-Gaussian MGF bound gives a sub-Gaussian tail $e^{-ct^2}$ .
The exact Binomial MGF gives the multiplicative Chernoff bound with tail shape $e^{-c\mu\delta^2}$ .

Canonical Examples

Example

Chernoff for a standard Gaussian

Let $X \sim \mathcal{N}(0, 1)$ . The MGF is $M_X(s) = e^{s^2/2}$ . The Chernoff bound gives:

$\Pr[X \geq t] \leq \inf_{s > 0} e^{-st + s^2/2}$

Optimize: $d/ds(-st + s^2/2) = 0$ gives $s^* = t$ . So $\Pr[X \geq t] \leq e^{-t^2/2}$ .

This is the standard Gaussian tail bound. The exact Gaussian tail is $\Pr[X \geq t] = \frac{1}{\sqrt{2\pi}} \int_t^\infty e^{-x^2/2}\,dx \approx \frac{e^{-t^2/2}}{t\sqrt{2\pi}}$ , so Chernoff is tight up to the polynomial factor $1/(t\sqrt{2\pi})$ .

Example

Estimating a rare event probability

Flip a coin with heads probability $p = 0.01$ a total of $n = 1000$ times. Let $S$ = number of heads, so $\mu = \mathbb{E}[S] = 10$ .

What is $\Pr[S \geq 20]$ ? This is $\Pr[S \geq 2\mu]$ , i.e., $\delta = 1$ .

Multiplicative Chernoff: $\Pr[S \geq 20] \leq e^{-\mu/3} = e^{-10/3} \approx 0.036$ .

Hoeffding: $\Pr[S \geq 20] = \Pr[\bar{X} - p \geq 0.01] \leq e^{-2n(0.01)^2} = e^{-0.2} \approx 0.82$ .

Chernoff gives $3.6\%$ vs. Hoeffding's $82\%$ . The multiplicative Chernoff is dramatically tighter because it uses the fact that $\mu$ is small relative to $n$ .

Common Confusions

Watch Out

Chernoff is a method, not a single inequality

When people say "Chernoff bound," they sometimes mean the general method (optimize over $s$ in the MGF bound) and sometimes mean the specific multiplicative bound for sums of independent [0,1] variables. Be aware of which one is meant. Hoeffding's inequality is a Chernoff bound, derived by plugging Hoeffding's lemma into the Chernoff method. The method is more general than any single application.

Watch Out

The Chernoff bound is one-sided

The basic Chernoff method with $s > 0$ bounds the upper tail $\Pr[X \geq t]$ . For the lower tail $\Pr[X \leq t]$ , use $s < 0$ (equivalently, apply the upper-tail bound to $-X$ ). For two-sided bounds, combine the two one-sided bounds via a union bound, picking up a factor of 2.

Watch Out

MGF must exist for Chernoff to work

The Chernoff method requires $M_X(s) < \infty$ for some $s > 0$ . This fails for heavy-tailed distributions: the Cauchy distribution has no finite moments at all, and the exponential distribution has $M_X(s) = \infty$ for $s \geq 1$ . For exponential random variables, you can still apply Chernoff but only for $s < 1$ , limiting the range of $t$ values you can bound.

Connection to Large Deviations: Sanov's Theorem Preview

The Chernoff bound for i.i.d. sums is the starting point of large deviations theory. Cramér's theorem says $\Pr[S_n/n \geq t] \approx e^{-n\Lambda^*(t)}$ , and this extends to a much more general setting.

Sanov's theorem generalizes this to empirical distributions: if $\hat{P}_n$ is the empirical distribution of $n$ i.i.d. draws from $P$ , then the probability that $\hat{P}_n$ lands in a set $\mathcal{E}$ of distributions decays as $e^{-n \inf_{Q \in \mathcal{E}} D_{\text{KL}}(Q \| P)}$ , where $D_{\text{KL}}$ is the KL divergence.

This connects Chernoff bounds to information theory: the "rate" at which empirical observations deviate from the truth is measured by KL divergence.

Summary

The Chernoff method: $\Pr[X \geq t] \leq e^{-st} M_X(s)$ , optimize over $s > 0$
This gives the tightest exponential bound from the MGF
Multiplicative Chernoff for Binomials: $\Pr[S \geq (1+\delta)\mu] \leq e^{-\mu\delta^2/3}$ for $\delta \in (0,1]$
Multiplicative form is tighter than Hoeffding when $\mu \ll n$
Every exponential concentration inequality (Hoeffding, Bernstein, etc.) is a Chernoff bound with a specific MGF estimate
Bernstein interpolates: variance-driven Gaussian tail for small $t$ , linear-in- $t$ Hoeffding-style tail when $Mt$ dominates the variance
Azuma-Hoeffding extends Hoeffding from independent sums to martingales with bounded differences $|X_k - X_{k-1}| \leq c_k$
The optimal rate $\Lambda^*(t)$ equals the large deviations rate function (Cramér's theorem)
Requires the MGF to exist; fails for heavy-tailed distributions

Exercises

ExerciseCore

Problem

Let $X \sim \text{Exp}(1)$ (exponential with rate 1). Use the Chernoff method to derive a tail bound for $\Pr[X \geq t]$ for $t > 1$ .

ExerciseCore

Problem

A randomized algorithm succeeds with probability $p = 0.7$ on each independent trial. You run it $n = 100$ times and take a majority vote. Use the multiplicative Chernoff bound to bound the probability that fewer than 50 trials succeed (i.e., the majority vote fails).

ExerciseAdvanced

Problem

Derive Hoeffding's inequality from the Chernoff method. Specifically, let $X_1, \ldots, X_n$ be independent with $X_i \in [a_i, b_i]$ . Starting from the Chernoff bound, use Hoeffding's lemma ( $\mathbb{E}[e^{s(X_i - \mu_i)}] \leq e^{s^2(b_i - a_i)^2/8}$ ) to derive:

$\Pr[\bar{X}_n - \mu \geq t] \leq \exp\!\left(-\frac{2n^2t^2}{\sum_i(b_i-a_i)^2}\right)$

Related Comparisons

Hoeffding vs. Bernstein Inequality

References

Canonical:

Mitzenmacher & Upfal, Probability and Computing (2017), Chapter 4
Boucheron, Lugosi, Massart, Concentration Inequalities (2013), Chapters 2-3 (Chernoff and Bernstein), Chapter 6 (martingale methods)
Vershynin, High-Dimensional Probability (2018), Chapter 2

Current:

Motwani & Raghavan, Randomized Algorithms (1995), Chapter 4
Wainwright, High-Dimensional Statistics (2019), Chapter 2 (sub-Gaussian, Bernstein, Azuma)
van Handel, Probability in High Dimension (2016), Chapters 1-3

Next Topics

Building on the Chernoff method:

Sub-Gaussian random variables: the abstract framework that captures all distributions whose MGF satisfies a Gaussian-type bound
Sub-exponential random variables: the next distributional class, for distributions whose MGF exists only in a limited range

Last reviewed: April 23, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Concentration Inequalitieslayer 1 · tier 1
Moment Generating Functionslayer 0A · tier 2

Derived topics

5

Hoeffding's Lemmalayer 1 · tier 1
Bennett's Inequalitylayer 2 · tier 1
Chi-Squared Concentrationlayer 2 · tier 1
Sub-Exponential Random Variableslayer 2 · tier 1
Sub-Gaussian Random Variableslayer 2 · tier 1

Graph-backed continuations

Sub-Gaussian Random Variables Sub-Exponential Random Variables Bennett's Inequality Chi-Squared Concentration Hoeffding's Lemma