Sub-Gaussian Random Variables

Sneiderman, Robby

Concentration Probability

Sub-Gaussian Random Variables

Sub-Gaussian random variables: the precise characterization of 'light-tailed' behavior that underpins exponential-tail concentration inequalities and the Gaussian-rate generalization bounds in learning theory.

CoreTier 1StableCore spine~75 min

Prerequisites

Concentration Inequalities Chernoff Bounds Skewness Kurtosis and Higher Moments Hoeffdings Lemma

Start 8-question practice · 15 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

concentration-probability | layer 2 | tier 1. This page has 4 direct prerequisites and 9 published dependents.

Open Atlas Prerequisites Leads to

What next

Sub-Exponential Random Variables

This is the first curated or graph-derived continuation from the current page.

Evidence badge

3 Lean-backed claims

2 match the public claim scope; 1 are dependency proofs. 0 sorry/admit markers are recorded.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Concentration inequalities are the engine of learning theory. The simplest concentration bounds --- Markov and Chebyshev --- require only finite first or second moments and give polynomial tail decay. The strongest non-asymptotic bounds --- Hoeffding, Chernoff, Bernstein --- require exponential-moment conditions and give exponential tail decay. Sub-Gaussian random variables are the precise class for which exponential-moment conditions are as tight as if the variable were Gaussian, even when it is not. Most Hoeffding-type generalization bounds in VC and Rademacher theory are sub-Gaussian bounds in disguise; the familiar $\sqrt{\log(1/\delta)/n}$ rates come from this class.

If you understand sub-Gaussianity, you understand why exponential-tail concentration works. Heavier tails (sub-exponential, polynomial) need their own machinery.

Tail Class Board

Why the exponent shape matters

Move the threshold and watch three tail assumptions separate. The concentration story is not just small probability; it is whether the exponent is quadratic, linear, or only polynomial.

Primary control: tail threshold

Larger thresholds expose the difference between quadratic, linear, and polynomial decay.

t = 2.2

Sub-exponential tail scale

Bigger scale makes the linear tail regime arrive earlier and decay more slowly.

kink 1.7

Sub-Gaussian at this threshold

0.18

Quadratic exponent. Large deviations disappear at Gaussian speed.

Formula shape

$exp (- c t^{2})$

The exponent shape is what determines whether sample complexity pays a log, square-root-log, or worse penalty.

ML translation

Sub-Gaussian assumptions power clean Hoeffding-style bounds. Squares, products, and chi-squared terms usually require sub-exponential Bernstein bounds instead.

A standard Gaussian saturates the sub-Gaussian tail bound. Whenever a "sub-Gaussian bound" curve appears distinct from a "Gaussian: $\exp(-t^2/2)$ " curve in tail plots, read the sub-Gaussian curve as $\exp(-ct^2)$ with an explicit generic constant $c < 1/2$ corresponding to some $\sigma > 1$ . The sub-Gaussian class includes Gaussian tails: a Gaussian variable $Z \sim \mathcal{N}(0,1)$ is itself sub-Gaussian with parameter $\sigma = 1$ .

Mental Model

A sub-Gaussian random variable has tails that decay at least as fast as those of a Gaussian. Concretely, the probability of being far from the mean drops like $e^{-ct^2}$ , not merely $e^{-ct}$ (sub-exponential) or $1/t^p$ (polynomial). This exponential-squared decay is what gives you the $\sqrt{\log(1/\delta)/n}$ rates in learning theory.

The key insight: many non-Gaussian distributions - bounded random variables, Rademacher random variables, and projections of high-dimensional uniform distributions - all satisfy this same tail behavior. Sub-Gaussianity captures exactly what these distributions have in common.

Formal Setup and Notation

Let $X$ be a real-valued random variable with $\mathbb{E}[X] = 0$ (we center without loss of generality).

Definition

Sub-Gaussian Random Variable (MGF characterization) $ψ_{2}$

A centered random variable $X$ is sub-Gaussian with parameter $\sigma$ if and only if for all $\lambda \in \mathbb{R}$ :

$\mathbb{E}[e^{\lambda X}] \leq e^{\sigma^2 \lambda^2 / 2}$

The smallest such $\sigma$ is called the sub-Gaussian parameter (or sub-Gaussian proxy variance) of $X$ .

This is the workhorse definition. It says the moment generating function of $X$ is dominated by that of a $\mathcal{N}(0, \sigma^2)$ random variable.

Definition

Sub-Gaussian Norm (Orlicz norm) $∥ X ∥_{ψ_{2}}$

The sub-Gaussian norm (or $\psi_2$ -norm) of $X$ is:

$\|X\|_{\psi_2} = \inf\{t > 0 : \mathbb{E}[e^{X^2/t^2}] \leq 2\}$

A random variable is sub-Gaussian if and only if $\|X\|_{\psi_2} < \infty$ . The sub-Gaussian parameter $\sigma$ and the $\psi_2$ -norm are related by $\sigma \asymp \|X\|_{\psi_2}$ (up to universal constants).

Definition

Tail Characterization

An equivalent characterization: $X$ is sub-Gaussian with parameter $\sigma$ if and only if for all $t > 0$ :

$\mathbb{P}(|X| \geq t) \leq 2 \exp\!\bigl(-\tfrac{t^2}{2\sigma^2}\bigr)$

This is the form you will use most often in practice.

Equivalent Characterizations

The following conditions are equivalent (up to universal constants in the parameters):

MGF condition: $\mathbb{E}[e^{\lambda X}] \leq e^{C_1^2 \lambda^2 / 2}$ for all $\lambda$
Tail condition: $\mathbb{P}(|X| \geq t) \leq 2e^{-t^2/(2C_2^2)}$ for all $t > 0$
Moment condition: $(\mathbb{E}[|X|^p])^{1/p} \leq C \|X\|_{\psi_2} \sqrt{p}$ for all $p \geq 1$
Orlicz norm: $\|X\|_{\psi_2} < \infty$

The constants across these four statements differ by at most universal multiplicative factors. The equivalence is what makes the sub-Gaussian class robust: you can verify membership via whichever characterization is most convenient.

Proof sketches for the harder directions

Tail $\Rightarrow$ MGF (layer cake). Suppose $\mathbb{P}(|X| \geq t) \leq 2e^{-t^2/(2K^2)}$ . For $\lambda > 0$ , by the layer-cake formula applied to the non-negative random variable $e^{\lambda X} \mathbf{1}\{X \geq 0\}$ :

$\mathbb{E}[e^{\lambda X} \mathbf{1}\{X \geq 0\}] = 1 + \int_0^\infty \lambda e^{\lambda t}\,\mathbb{P}(X \geq t)\,dt \leq 1 + 2\int_0^\infty \lambda e^{\lambda t - t^2/(2K^2)}\,dt.$

Complete the square in the exponent: $\lambda t - t^2/(2K^2) = K^2\lambda^2/2 - (t - K^2\lambda)^2/(2K^2)$ . The Gaussian integral yields $\mathbb{E}[e^{\lambda X}] \leq e^{cK^2\lambda^2}$ for a universal $c$ . The same argument applied to $-X$ handles $\lambda < 0$ .

Moment $\Rightarrow$ MGF (Taylor expansion). Suppose $(\mathbb{E}[|X|^p])^{1/p} \leq K\sqrt{p}$ for all $p \geq 1$ . Expand the MGF term by term:

$\mathbb{E}[e^{\lambda X}] = 1 + \sum_{k=1}^\infty \frac{\lambda^k \mathbb{E}[X^k]}{k!} \leq 1 + \sum_{k=1}^\infty \frac{|\lambda|^k K^k k^{k/2}}{k!}.$

Stirling's bound $k! \geq (k/e)^k$ gives $k^{k/2}/k! \leq (e/\sqrt{k})^k \cdot k^{-k/2} \leq C^k / k^{k/2}$ . For $|\lambda| \leq c/K$ the series converges to something bounded by $e^{cK^2\lambda^2}$ . Extend to all $\lambda$ via a separate bound in the large- $\lambda$ regime using the tail equivalence.

Interactive: three views of one definition

Same distribution, three separate tests. Switch the tab to compare the tail, the log-MGF, and the $\psi_2$ expectation side by side. Gaussian, Rademacher, and Uniform $[-1,1]$ pass all three. Laplace and the centered Exponential have only exponential (not Gaussian) tails, so they pass none of the three sub-Gaussian conditions globally. The interactive panel inspects each condition on a finite range, so a distribution may appear to satisfy the tail check inside the visible window while still violating the MGF and $\psi_2$ conditions at large enough $\lambda$ or $p$ ; that finite-window artifact is what the tab labels are showing. Watching a distribution fail one panel while holding another makes the equivalence concrete: each characterization picks up the same sub-Gaussian tail behavior from a different direction, and any genuinely sub-Gaussian variable passes or fails them together.

distribution

envelope σσ = 1.00

sub-gaussian explorer

Tail comparison

Pr (∣ X ∣ \geq t) \leq 2 exp (- \frac{t ^{2}}{2 σ ^{2}})

Gaussian tail

envelope 2 e^{- t^{2} / (2 σ^{2})}

Gaussian is sub-Gaussian with parameter 1.00 — tail sits under the envelope for all t.

How to read this

$The envelope 2 exp (- t^{2} / (2 σ^{2})) is the sub-Gaussian tail promise. Any distribution whose tail stays below it is sub-Gaussian with parameter σ .$

$Gaussian matches the envelope exactly. Rademacher and Uniform are bounded, so their tails drop to zero at t = 1 and sit trivially underneath.$

$Centered Exponential and Laplace have tails that decay like e^{- t}, which is slower than e^{- t^{2} /2} . For any fixed σ, the e^{- t} curve eventually crosses above the envelope. No choice of σ can fix this: these distributions are subexponential, not sub-Gaussian.$

Each tab is a different characterization of sub-Gaussianity. Switch distributions to watch which ones pass and which ones break.

Main Theorems

Theorem

Sub-Gaussian MGF Implies Tail Bound

Statement

If $X$ is a centered random variable satisfying $\mathbb{E}[e^{\lambda X}] \leq e^{\sigma^2 \lambda^2/2}$ for all $\lambda \in \mathbb{R}$ , then for all $t > 0$ :

$\mathbb{P}(X \geq t) \leq \exp\!\bigl(-\tfrac{t^2}{2\sigma^2}\bigr)$

Exact statement

E [e^{λ X}] \leq e^{σ^{2} λ^{2} /2} \Rightarrow P (X \geq t) \leq exp (- \frac{t ^{2}}{2 σ ^{2}})

LaTeX source for copy/export

\mathbb{E}[e^{\lambda X}] \leq e^{\sigma^2\lambda^2/2} \Rightarrow \mathbb{P}(X \geq t) \leq \exp\!\left(-\frac{t^2}{2\sigma^2}\right)

Intuition

The MGF condition is a generating condition. It controls all moments simultaneously. The tail bound is a consequence obtained by the Chernoff method: exponentiate, apply Markov, and optimize over $\lambda$ . The quadratic exponent in the MGF translates directly to a quadratic exponent in the tail.

Proof Sketch

For any $\lambda > 0$ , by Markov's inequality:

$\mathbb{P}(X \geq t) = \mathbb{P}(e^{\lambda X} \geq e^{\lambda t}) \leq e^{-\lambda t}\,\mathbb{E}[e^{\lambda X}] \leq e^{-\lambda t + \sigma^2 \lambda^2/2}$

Minimize over $\lambda$ : set $\lambda^* = t/\sigma^2$ to get $\mathbb{P}(X \geq t) \leq e^{-t^2/(2\sigma^2)}$ .

Why It Matters

This is the Chernoff method: the single most important proof technique in concentration. Every time you see an exponential tail bound, this is the engine underneath. Mastering this one argument gives you access to all of sub-Gaussian concentration theory.

Failure Mode

The bound is only tight up to constants. For a true Gaussian, the bound is off by a polynomial factor in $t$ (the exact Gaussian tail has a $1/t$ prefactor). For non-asymptotic purposes this rarely matters.

report a correction →

Lemma

Hoeffding's Lemma

Statement

If $X$ is a centered random variable with $X \in [a, b]$ almost surely, then for all $\lambda \in \mathbb{R}$ :

$\mathbb{E}[e^{\lambda X}] \leq \exp\!\bigl(\tfrac{\lambda^2(b-a)^2}{8}\bigr)$

That is, $X$ is sub-Gaussian with parameter $\sigma = (b-a)/2$ .

Exact statement

E [e^{λ X}] \leq exp (\frac{λ ^{2} ( b - a ) ^{2}}{8})

LaTeX source for copy/export

\mathbb{E}[e^{\lambda X}] \leq \exp\!\left(\frac{\lambda^2(b-a)^2}{8}\right)

Intuition

Bounded random variables cannot have heavy tails. They literally cannot take values beyond $[a,b]$ . Hoeffding's lemma quantifies this: the MGF of any bounded centered variable is dominated by a Gaussian MGF. The factor of $1/8$ comes from a convexity argument applied to $e^{\lambda x}$ on the interval $[a,b]$ .

Proof Sketch

Since $X \in [a, b]$ , write $X = \alpha b + (1-\alpha) a$ for some random $\alpha \in [0,1]$ . By convexity of $e^{\lambda x}$ :

$\mathbb{E}[e^{\lambda X}] \leq \mathbb{E}[\alpha e^{\lambda b} + (1-\alpha) e^{\lambda a}]$

Use $\mathbb{E}[X] = 0$ to express $\mathbb{E}[\alpha]$ in terms of $a, b$ . Then apply the inequality $\log(pe^s + qe^{-s}) \leq s^2/8$ for $p + q = 1$ (which follows from Taylor expansion and bounding the remainder). Setting $s = \lambda(b-a)/2$ gives the result.

Why It Matters

Hoeffding's lemma is the bridge between "bounded" and "sub-Gaussian." It explains why bounded losses (e.g., 0-1 loss) always yield $O(1/\sqrt{n})$ concentration. which is exactly what appears in ERM generalization bounds for finite hypothesis classes.

Failure Mode

The bound treats all bounded distributions the same: a constant taking value $0$ and a uniform on $[-1, 1]$ get the same sub-Gaussian parameter. For distributions concentrated near $0$ , the variance-based Bernstein inequality gives tighter bounds. Hoeffding's lemma is worst-case over the bounded class.

report a correction →

Canonical Examples

Example

Gaussian random variable

If $X \sim \mathcal{N}(0, \sigma^2)$ , then $\mathbb{E}[e^{\lambda X}] = e^{\sigma^2 \lambda^2/2}$ exactly. Gaussians are sub-Gaussian with parameter exactly $\sigma$ . This is the "prototype". The definition is designed so that Gaussians saturate the bound.

Example

Rademacher random variable

Let $\varepsilon$ be a Rademacher variable, so $\mathbb{P}(\varepsilon = +1) = \mathbb{P}(\varepsilon = -1) = 1/2$ . Then $\varepsilon \in [-1, 1]$ , so by Hoeffding's lemma, $\varepsilon$ is sub-Gaussian with parameter $\sigma = 1$ . In fact, the exact MGF is $\mathbb{E}[e^{\lambda \varepsilon}] = \cosh(\lambda) \leq e^{\lambda^2/2}$ , confirming sub-Gaussianity directly.

Rademacher variables appear everywhere in symmetrization arguments and Rademacher complexity.

Example

Bounded random variable

If $X \in [a, b]$ almost surely with $\mathbb{E}[X] = 0$ , then $X$ is sub-Gaussian with parameter $(b-a)/2$ by Hoeffding's lemma. This covers:

0-1 loss: $\sigma = 1/2$
Any loss bounded in $[0, M]$ (after centering): $\sigma = M/2$
Features bounded in $[-B, B]$ : $\sigma = B$

Example

Sum of independent sub-Gaussians

If $X_1, \ldots, X_n$ are independent, centered, sub-Gaussian with parameters $\sigma_1, \ldots, \sigma_n$ , then $S = \sum_{i=1}^n X_i$ is sub-Gaussian with parameter $\sigma = \sqrt{\sigma_1^2 + \cdots + \sigma_n^2}$ .

This follows because MGFs multiply under independence:

\mathbb{E}[e^{\lambda S}] = \prod_i \mathbb{E}[e^{\lambda X_i}] \leq \prod_i e^{\sigma_i^2 \lambda^2/2} = e^{\lambda^2 \sum_i \sigma_i^2/2}.

For the sample mean $\bar{X} = S/n$ with equal $\sigma_i = \sigma$ : sub-Gaussian parameter is $\sigma/\sqrt{n}$ , giving the familiar $O(1/\sqrt{n})$ concentration.

Common Confusions

Watch Out

Sub-Gaussian parameter is NOT the standard deviation

The sub-Gaussian parameter $\sigma$ is an upper bound on the effective spread, but it is not equal to $\text{Var}(X)^{1/2}$ in general. For a variable supported on $[-1, 1]$ with variance $0.01$ , Hoeffding gives $\sigma = 1$ , far larger than $\text{sd}(X) = 0.1$ . This is why Bernstein inequalities (which use variance information) are sometimes much tighter than Hoeffding-type bounds.

Watch Out

Not all 'nice' distributions are sub-Gaussian

Exponential, Poisson, and chi-squared random variables are not sub-Gaussian variables. Their tails decay like $e^{-ct}$ rather than $e^{-ct^2}$ . These are sub-exponential. The distinction matters: sub-exponential concentration gives $O(1/\sqrt{n})$ rates only for the mean, not for individual deviations, and requires separate treatment of "small $\lambda$ " and "large $\lambda$ " regimes.

Watch Out

The centering assumption is essential

The MGF condition $\mathbb{E}[e^{\lambda X}] \leq e^{\sigma^2\lambda^2/2}$ implicitly requires $\mathbb{E}[X] = 0$ . If $\mathbb{E}[X] = \mu \neq 0$ , you apply the definition to $X - \mu$ , not to $X$ directly. Failing to center will give you wrong tail bounds.

Centered vs Non-Centered Sub-Gaussian

The canonical MGF definition assumes $\mathbb{E}[X] = 0$ . For non-centered $X$ with $\mathbb{E}[X] = \mu$ and $X - \mu$ sub-Gaussian with parameter $\sigma$ :

$\mathbb{E}[e^{\lambda X}] \leq e^{\lambda \mu + \sigma^2 \lambda^2/2} \quad \text{for all } \lambda \in \mathbb{R}.$

This is the natural non-centered analog. The linear term in the exponent carries the mean; the quadratic term controls the tails. All downstream closure and concentration results go through after the mean shift.

The $\psi_2$ -norm, unlike the sub-Gaussian parameter, does not require centering: it is a norm on $X$ itself, since $\psi_2(x) = e^{x^2} - 1$ depends on $|X|$ . This is one reason the $\psi_2$ formulation is often cleaner to work with.

Why Sub-Gaussian is the Right Abstraction

Three reasons sub-Gaussianity appears everywhere in ML theory:

Closure under summation: sums of independent sub-Gaussians are sub-Gaussian. This is exactly what you need for sample averages.
Captures all bounded losses: every bounded loss function produces sub-Gaussian empirical averages. Since most learning theory starts with bounded losses, sub-Gaussianity is the natural first assumption.
Tight enough for optimal rates: sub-Gaussian concentration gives $O(1/\sqrt{n})$ deviation bounds for sample means, which matches the CLT rate. You cannot do better in general.

The sub-Gaussian class is the largest class of distributions for which the Chernoff method gives Gaussian-quality tail bounds. Going beyond sub-Gaussian (to sub-exponential, or heavier tails) requires different tools and yields weaker concentration.

The Orlicz Norm Hierarchy

The $\psi_2$ norm makes the set of sub-Gaussian random variables a Banach space $L_{\psi_2}$ , sitting in a strict inclusion chain on a probability space:

$L^\infty \subset L_{\psi_2} \subset L_{\psi_1} \subset \bigcap_{p < \infty} L^p$

Bounded $\subset$ Sub-Gaussian $\subset$ Sub-exponential $\subset$ All-moments-finite. Each inclusion is strict: a standard Gaussian is sub-Gaussian but not bounded; an $\text{Exp}(1)$ variable is sub-exponential but not sub-Gaussian.

Recall that on a probability space $L^p$ decreases with $p$ , so $L^\infty \subset L^q \subset L^p$ whenever $p \leq q$ . Sub-exponential variables have all moments finite: they sit inside every $L^p$ for $p < \infty$ .

The $\psi_1$ norm is defined analogously with $\psi_1(x) = e^x - 1$ : $\|X\|_{\psi_1} = \inf\{t > 0 : \mathbb{E}[e^{|X|/t}] \leq 2\}$ .

Exact $\psi_2$ Norms for Common Distributions

Distribution	$\\|X\\|_{\psi_2}$	Sub-Gaussian $\sigma$	Key step
Rademacher $\varepsilon = \pm 1$	$1/\sqrt{\ln 2} \approx 1.20$	$1$	$\varepsilon^2 = 1$ always, so solve $e^{1/t^2} = 2$
Gaussian $Z \sim \mathcal{N}(0,1)$	$\sqrt{8/3} \approx 1.633$	$1$	$\mathbb{E}[e^{Z^2/t^2}] = (1 - 2/t^2)^{-1/2}$ for $t > \sqrt{2}$ ; set equal to $2$ to get $t^2 = 8/3$
Bounded $X \in [a,b]$ , centered	$C(b-a)$	$(b-a)/2$	Hoeffding's lemma gives $\sigma = (b-a)/2$
Gaussian $Z \sim \mathcal{N}(0, \sigma^2)$	$C\sigma$	$\sigma$	Homogeneity: $\\|\sigma Z\\|_{\psi_2} = \sigma \\|Z\\|_{\psi_2}$

Closure Properties

Sub-Gaussianity is useful precisely because it is preserved under the operations that appear in proofs.

Theorem

Closure Under Independent Sums

Statement

Suppose $X_1, \ldots, X_n$ are independent, centered, and $\|X_i\|_{\psi_2} \leq K$ for all $i$ . Then for any $a = (a_1, \ldots, a_n) \in \mathbb{R}^n$ :

$\left\|\sum_i a_i X_i\right\|_{\psi_2} \leq CK \|a\|_2$

where $C$ is a universal constant.

Exact statement

i \sum a_{i} X_{i}_{ψ_{2}} \leq C K ∥ a ∥_{2}

LaTeX source for copy/export

\left\|\sum_i a_i X_i\right\|_{\psi_2} \leq CK\|a\|_2

Intuition

By independence, MGFs multiply: $\mathbb{E}[e^{\lambda \sum a_i X_i}] = \prod_i \mathbb{E}[e^{\lambda a_i X_i}] \leq \prod_i e^{C^2 K^2 \lambda^2 a_i^2 / 2} = e^{C^2 K^2 \lambda^2 \|a\|_2^2 / 2}$ . The $\ell_2$ norm of $a$ controls the sub-Gaussian parameter of the sum.

Failure Mode

Without independence, the bound fails. If $X_1 = X_2 = \cdots = X_n = X$ , then $\sum X_i = nX$ with $\|nX\|_{\psi_2} = n\|X\|_{\psi_2}$ , but the theorem would predict $CK\sqrt{n}$ . The gap between $\sqrt{n}$ and $n$ is exactly the cost of perfect correlation.

report a correction →

Other closure properties:

Scaling: $\|aX\|_{\psi_2} = |a| \cdot \|X\|_{\psi_2}$ (immediate from definition).
Lipschitz maps: if $f$ is $L$ -Lipschitz with $f(0) = 0$ , then $\|f(X)\|_{\psi_2} \leq CL \cdot \|X\|_{\psi_2}$ .
Products break sub-Gaussianity: if $X, Y$ are independent sub-Gaussian, then $XY$ is sub-exponential, not sub-Gaussian. In particular, $X^2$ is sub-exponential whenever $X$ is sub-Gaussian. This is exactly why Bernstein's inequality has two regimes.

Sub-Gaussian Maxima

Theorem

Maximum of Sub-Gaussians

Statement

$\mathbb{E}\!\left[\max_{i \leq n} X_i\right] \leq CK\sqrt{\log n}$

Moreover, for any $t \geq 0$ :

$\mathbb{P}\!\left(\max_{i \leq n} X_i \geq CK\sqrt{\log n} + t\right) \leq e^{-t^2/(2K^2)}$

The maximum is a one-sided event, so there is no factor of $2$ on the right-hand side.

Exact statement

E [i \leq n max X_{i}] \leq C K lo g n

LaTeX source for copy/export

\mathbb{E}\!\left[\max_{i \leq n} X_i\right] \leq CK\sqrt{\log n}

Intuition

For any $\lambda > 0$ : $\mathbb{E}[\max X_i] \leq \frac{1}{\lambda}\log\sum_i \mathbb{E}[e^{\lambda X_i}] \leq \frac{1}{\lambda}(\log n + C^2 K^2 \lambda^2/2)$ . Setting $\lambda = \sqrt{2\log n}/(CK)$ gives $CK\sqrt{2\log n}$ .

Why It Matters

This is where the $\log|H|$ term in PAC bounds comes from. Bounding $\sup_{h \in H}|R(h) - \hat{R}(h)|$ over a finite hypothesis class $|H| = n$ is bounding the maximum of $n$ sub-Gaussian variables. The $\sqrt{\log n}$ overhead is the price of uniformity.

Failure Mode

For sub-exponential variables, the maximum grows as $\log n$ (not $\sqrt{\log n}$ ). The heavier tails make the worst case worse. For heavy-tailed variables (polynomial tails), the maximum can grow polynomially in $n$ .

report a correction →

The Sub-Exponential Bridge

The relationship between sub-Gaussian and sub-exponential is precise:

If $X$ is sub-Gaussian, then $X^2$ is sub-exponential with $\|X^2\|_{\psi_1} = \|X\|_{\psi_2}^2$ (identity, under the standard Orlicz norm convention $\psi_1(x) = e^x - 1$ , $\psi_2(x) = e^{x^2} - 1$ ).
Conversely, if $X^2$ is sub-exponential, then $X$ is sub-Gaussian.

This explains why Bernstein's inequality for centered sub-exponential variables $\sum X_i$ has two regimes:

$\mathbb{P}\!\left(\left|\sum X_i\right| \geq t\right) \leq 2\exp\!\left(-c \cdot \min\!\left(\frac{t^2}{\sum \|X_i\|_{\psi_1}^2},\; \frac{t}{\max \|X_i\|_{\psi_1}}\right)\right)$

For small $t$ , the $t^2$ term dominates (sub-Gaussian regime). For large $t$ , the $t$ term dominates (sub-exponential regime, heavier tail). Setting $t^2/\sum\|X_i\|_{\psi_1}^2 = t/\max\|X_i\|_{\psi_1}$ gives the crossover $t^\star = \sum\|X_i\|_{\psi_1}^2 / \max\|X_i\|_{\psi_1}$ (Vershynin Thm 2.9.1).

Proof Checklist

When writing or reading a sub-Gaussian argument:

Identify the random variable. What quantity do you want to concentrate? Write it as $Z = g(X_1, \ldots, X_n)$ .
Center it. Work with $Z - \mathbb{E}[Z]$ .
Decompose if needed. Express $Z - \mathbb{E}[Z]$ as a sum or martingale difference sequence.
Check sub-Gaussianity of each piece. Bounded? Hoeffding gives $\sigma_i$ . Lipschitz of Gaussian? Gaussian concentration applies.
Propagate. Independent sum: $\sigma^2 = \sum \sigma_i^2$ . Martingale: Azuma-Hoeffding. Linear combination: $\|a\|_2$ scaling.
Apply tail bound. $\mathbb{P}(|Z - \mathbb{E}[Z]| \geq t) \leq 2e^{-t^2/(2\sigma^2)}$ . Invert to get confidence intervals.

Exercises

ExerciseCore

Problem

Let $X$ be a Rademacher random variable ( $\pm 1$ with equal probability). Verify directly that $\mathbb{E}[e^{\lambda X}] = \cosh(\lambda) \leq e^{\lambda^2/2}$ for all $\lambda$ .

ExerciseCore

Problem

Let $X_1, \ldots, X_n$ be i.i.d. with $X_i \in [0, 1]$ and $\mathbb{E}[X_i] = \mu$ . Using Hoeffding's lemma and the Chernoff method, prove that:

$\mathbb{P}\!\bigl(\bar{X}_n - \mu \geq t\bigr) \leq e^{-2nt^2}$

ExerciseAdvanced

Problem

Show that the exponential distribution $X \sim \text{Exp}(1)$ is not sub-Gaussian. Specifically, show that $\mathbb{E}[e^{\lambda(X-1)}]$ cannot be bounded by $e^{C\lambda^2}$ for any constant $C$ when $\lambda$ is large.

ExerciseAdvanced

Problem

Verify the table entry for $Z \sim \mathcal{N}(0,1)$ : $\|Z\|_{\psi_2} = \sqrt{8/3}$ . Start from $\mathbb{E}[e^{Z^2/t^2}] = (1 - 2/t^2)^{-1/2}$ for $t > \sqrt{2}$ , and solve for the smallest $t$ with $\mathbb{E}[e^{Z^2/t^2}] \leq 2$ .

ExerciseAdvanced

Problem

Let $X_1, \ldots, X_n$ be independent sub-Gaussian with $\|X_i\|_{\psi_2} \leq K$ . Show that $\mathbb{E}[\max_{i \leq n} X_i] \leq CK\sqrt{\log n}$ using the argument: exponentiate, bound by sum, apply MGF bound, optimize $\lambda$ .

Related Comparisons

Sub-Gaussian vs. Sub-Exponential Random Variables

References

Vershynin, High-Dimensional Probability (2018), Chapters 2-3. Definitive modern treatment using the Orlicz norm framework. Proposition 2.5.2 for the equivalence theorem.
Boucheron, Lugosi, Massart, Concentration Inequalities (2013), Chapter 2. Uses the entropy method rather than Orlicz norms.
Wainwright, High-Dimensional Statistics (2019), Chapter 2. Clear exposition connecting sub-Gaussian theory to statistical applications.
van Handel, Probability in High Dimension (2016), Chapters 1-2. Excellent treatment of sub-Gaussian maxima and chaining.
Shalev-Shwartz and Ben-David, Understanding Machine Learning (2014), Chapters 26-28. How sub-Gaussian bounds enter learning theory.
Rigollet and Hutter, High-Dimensional Statistics (MIT lecture notes, 2023), Chapter 1. Concise derivation of all five characterizations with explicit constants.

Next Topics

Natural next steps from sub-Gaussian random variables:

Sub-exponential random variables: the next distributional class, for heavier tails.
Hanson-Wright inequality: sub-Gaussian concentration for quadratic forms $X^\top A X$ .
Epsilon-nets and covering numbers: combining sub-Gaussian concentration with geometric arguments. Dudley's chaining integral, developed in empirical processes and chaining, sharpens the $\sqrt{\log n}$ maxima bound when $X_i$ are indexed by a metric space.
Azuma-Hoeffding inequality via martingale theory: sub-Gaussian concentration for martingale difference sequences, the natural generalization beyond independent sums.
McDiarmid's inequality: concentration for functions of independent variables with bounded differences. Direct application of sub-Gaussian ideas via a martingale expansion.
Tsirelson-Ibragimov-Sudakov (Gaussian Lipschitz concentration): for $f : \mathbb{R}^n \to \mathbb{R}$ that is $L$ -Lipschitz and $Z \sim \mathcal{N}(0, I_n)$ , $f(Z) - \mathbb{E}[f(Z)]$ is sub-Gaussian with parameter $L$ . The cleanest statement of sub-Gaussian behavior in high dimension.

Last reviewed: April 26, 2026

Claim evidence

Selected claims on this topic have machine-checked support.

Collapsed by default because this is audit material. Open it to see exact theorem names, claim scopes, and source roles.

Click to expand

2

claim matches

1

dependency proofs

0

incomplete markers

This is claim-level evidence, not a whole-page badge. The checked theorem must match the recorded claim scope. Supporting lemmas stay labeled as dependency proofs, not full claim matches.

Hoeffding Lemma

Formal support is recorded for this governed claim.

Matches claim

Formal record

Checked theorem: TheoremPath.Probability.Concentration.hoeffdingLemmaBoundedCenteredSubgaussianMGF
Claim scope: bounded centered real subgaussian mgf
Proof scope: exact mathlib wrapper for hoeffding lemma
Mathlib theorem: ProbabilityTheory.hasSubgaussianMGF_of_mem_Icc_of_integral_eq_zero

Subgaussian Linear Combination

Formal support is recorded for this governed claim.

Dependency proof

Formal record

Checked theorem: TheoremPath.Probability.Concentration.subGaussianIndependentFiniteLinearCombinationMGF
Claim scope: finite independent linear combination mgf parameter
Proof scope: scoped mathlib wrapper for finite subgaussian mgf linear combination
Mathlib theorem: ProbabilityTheory.HasSubgaussianMGF.sum_of_iIndepFun, ProbabilityTheory.HasSubgaussianMGF.const_mul

Subgaussian Mgf Characterization

Formal support is recorded for this governed claim.

Matches claim

Formal record

Checked theorem: TheoremPath.Probability.Concentration.subGaussianMGFImpliesRightTailBound
Claim scope: right tail bound from subgaussian mgf condition
Proof scope: exact mathlib wrapper for subgaussian mgf tail bound
Mathlib theorem: ProbabilityTheory.HasSubgaussianMGF.measure_ge_le

See the public evidence page for the display rules and representative Lean mapping examples.

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Chernoff Boundslayer 1 · tier 1
Concentration Inequalitieslayer 1 · tier 1
Hoeffding's Lemmalayer 1 · tier 1
Skewness, Kurtosis, and Higher Momentslayer 1 · tier 1

Derived topics

9

Sub-Exponential Random Variableslayer 2 · tier 1
Epsilon-Nets and Covering Numberslayer 3 · tier 1
Matrix Concentrationlayer 3 · tier 1
McDiarmid's Inequalitylayer 3 · tier 1
Measure Concentration and Geometric Functional Analysislayer 3 · tier 1

+4 more on the derived-topics page.

Graph-backed continuations

Sub-Exponential Random Variables Epsilon-Nets and Covering Numbers Hanson-Wright Inequality Matrix Concentration McDiarmid's Inequality Measure Concentration and Geometric Functional Analysis Restricted Isometry Property Sparse Recovery and Compressed Sensing Rademacher Complexity

Read this page in the graph.

Why This Matters

Why the exponent shape matters

Mental Model

Formal Setup and Notation

Equivalent Characterizations

Proof sketches for the harder directions

Interactive: three views of one definition

Tail comparison

Main Theorems

Canonical Examples

Common Confusions

Centered vs Non-Centered Sub-Gaussian

Why Sub-Gaussian is the Right Abstraction

The Orlicz Norm Hierarchy

Exact ψ2\psi_2ψ2​ Norms for Common Distributions

Closure Properties

Sub-Gaussian Maxima

The Sub-Exponential Bridge

Proof Checklist

Exercises

Related Comparisons

References

Next Topics

Selected claims on this topic have machine-checked support.

Hoeffding Lemma

Subgaussian Linear Combination

Subgaussian Mgf Characterization

Required before and derived from this topic

Required prerequisites

Derived topics

Exact $\psi_2$ Norms for Common Distributions