Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Concentration Probability

Sub-Gaussian Random Variables

Sub-Gaussian random variables: the precise characterization of 'light-tailed' behavior that underpins every concentration inequality in learning theory.

CoreTier 1Stable~75 min

Why This Matters

Concentration inequalities are the engine of learning theory. Every generalization bound you will encounter. from VC dimension to Rademacher complexity to PAC-Bayes. relies on controlling how far a sample average deviates from its expectation. Sub-Gaussian random variables are the precise class for which this control is as tight as if the variable were Gaussian, even when it is not.

If you understand sub-Gaussianity, you understand why concentration works. If you do not, every bound you see will feel like magic.

Gaussian: exp(-t²/2)Sub-Gaussian boundSub-exponentialHeavy tail (1/t²)10⁻210⁻410⁻610⁻8P(X t) log scale01234Deviation t

Mental Model

A random variable is sub-Gaussian if its tails decay at least as fast as those of a Gaussian. Concretely: the probability of being far from the mean drops like ect2e^{-ct^2}, not merely ecte^{-ct} (sub-exponential) or 1/tp1/t^p (polynomial). This exponential-squared decay is what gives you the log(1/δ)/n\sqrt{\log(1/\delta)/n} rates in learning theory.

The key insight: many non-Gaussian distributions. bounded random variables, Rademacher random variables, projections of high-dimensional uniform distributions. all satisfy this same tail behavior. Sub-Gaussianity captures exactly what these distributions have in common.

Formal Setup and Notation

Let XX be a real-valued random variable with E[X]=0\mathbb{E}[X] = 0 (we center without loss of generality).

Definition

Sub-Gaussian Random Variable (MGF characterization)

A centered random variable XX is sub-Gaussian with parameter σ\sigma if for all λR\lambda \in \mathbb{R}:

E[eλX]eσ2λ2/2\mathbb{E}[e^{\lambda X}] \leq e^{\sigma^2 \lambda^2 / 2}

The smallest such σ\sigma is called the sub-Gaussian parameter (or sub-Gaussian proxy variance) of XX.

This is the workhorse definition. It says the moment generating function of XX is dominated by that of a N(0,σ2)\mathcal{N}(0, \sigma^2) random variable.

Definition

Sub-Gaussian Norm (Orlicz norm)

The sub-Gaussian norm (or ψ2\psi_2-norm) of XX is:

Xψ2=inf{t>0:E[eX2/t2]2}\|X\|_{\psi_2} = \inf\{t > 0 : \mathbb{E}[e^{X^2/t^2}] \leq 2\}

A random variable is sub-Gaussian if and only if Xψ2<\|X\|_{\psi_2} < \infty. The sub-Gaussian parameter σ\sigma and the ψ2\psi_2-norm are related by σXψ2\sigma \asymp \|X\|_{\psi_2} (up to universal constants).

Definition

Tail Characterization

An equivalent characterization: XX is sub-Gaussian with parameter σ\sigma if and only if for all t>0t > 0:

P(Xt)2exp ⁣(t22σ2)\mathbb{P}(|X| \geq t) \leq 2 \exp\!\bigl(-\tfrac{t^2}{2\sigma^2}\bigr)

This is the form you will use most often in practice.

Equivalent Characterizations

The following conditions are equivalent (up to constants in the parameters):

  1. MGF condition: E[eλX]eC12λ2/2\mathbb{E}[e^{\lambda X}] \leq e^{C_1^2 \lambda^2 / 2} for all λ\lambda
  2. Tail condition: P(Xt)2et2/(2C22)\mathbb{P}(|X| \geq t) \leq 2e^{-t^2/(2C_2^2)} for all t>0t > 0
  3. Moment condition: (E[Xp])1/pC3p(\mathbb{E}[|X|^p])^{1/p} \leq C_3 \sqrt{p} for all p1p \geq 1
  4. Orlicz norm: Xψ2C4\|X\|_{\psi_2} \leq C_4

The constants C1,C2,C3,C4C_1, C_2, C_3, C_4 differ by at most universal multiplicative factors. This equivalence is what makes the sub-Gaussian class robust: you can verify membership via whichever characterization is most convenient.

Main Theorems

Theorem

Sub-Gaussian MGF Implies Tail Bound

Statement

If XX is a centered random variable satisfying E[eλX]eσ2λ2/2\mathbb{E}[e^{\lambda X}] \leq e^{\sigma^2 \lambda^2/2} for all λR\lambda \in \mathbb{R}, then for all t>0t > 0:

P(Xt)exp ⁣(t22σ2)\mathbb{P}(X \geq t) \leq \exp\!\bigl(-\tfrac{t^2}{2\sigma^2}\bigr)

Intuition

The MGF condition is a generating condition. It controls all moments simultaneously. The tail bound is a consequence obtained by the Chernoff method: exponentiate, apply Markov, and optimize over λ\lambda. The quadratic exponent in the MGF translates directly to a quadratic exponent in the tail.

Proof Sketch

For any λ>0\lambda > 0, by Markov's inequality:

P(Xt)=P(eλXeλt)eλtE[eλX]eλt+σ2λ2/2\mathbb{P}(X \geq t) = \mathbb{P}(e^{\lambda X} \geq e^{\lambda t}) \leq e^{-\lambda t}\,\mathbb{E}[e^{\lambda X}] \leq e^{-\lambda t + \sigma^2 \lambda^2/2}

Minimize over λ\lambda: set λ=t/σ2\lambda^* = t/\sigma^2 to get P(Xt)et2/(2σ2)\mathbb{P}(X \geq t) \leq e^{-t^2/(2\sigma^2)}.

Why It Matters

This is the Chernoff method: the single most important proof technique in concentration. Every time you see an exponential tail bound, this is the engine underneath. Mastering this one argument gives you access to all of sub-Gaussian concentration theory.

Failure Mode

The bound is only tight up to constants. For a true Gaussian, the bound is off by a polynomial factor in tt (the exact Gaussian tail has a 1/t1/t prefactor). For non-asymptotic purposes this rarely matters.

Lemma

Hoeffding's Lemma

Statement

If XX is a centered random variable with X[a,b]X \in [a, b] almost surely, then for all λR\lambda \in \mathbb{R}:

E[eλX]exp ⁣(λ2(ba)28)\mathbb{E}[e^{\lambda X}] \leq \exp\!\bigl(\tfrac{\lambda^2(b-a)^2}{8}\bigr)

That is, XX is sub-Gaussian with parameter σ=(ba)/2\sigma = (b-a)/2.

Intuition

Bounded random variables cannot have heavy tails. They literally cannot take values beyond [a,b][a,b]. Hoeffding's lemma quantifies this: the MGF of any bounded centered variable is dominated by a Gaussian MGF. The factor of 1/81/8 comes from a convexity argument applied to eλxe^{\lambda x} on the interval [a,b][a,b].

Proof Sketch

Since X[a,b]X \in [a, b], write X=αb+(1α)aX = \alpha b + (1-\alpha) a for some random α[0,1]\alpha \in [0,1]. By convexity of eλxe^{\lambda x}:

E[eλX]E[αeλb+(1α)eλa]\mathbb{E}[e^{\lambda X}] \leq \mathbb{E}[\alpha e^{\lambda b} + (1-\alpha) e^{\lambda a}]

Use E[X]=0\mathbb{E}[X] = 0 to express E[α]\mathbb{E}[\alpha] in terms of a,ba, b. Then apply the inequality log(pes+qes)s2/8\log(pe^s + qe^{-s}) \leq s^2/8 for p+q=1p + q = 1 (which follows from Taylor expansion and bounding the remainder). Setting s=λ(ba)/2s = \lambda(b-a)/2 gives the result.

Why It Matters

Hoeffding's lemma is the bridge between "bounded" and "sub-Gaussian." It explains why bounded losses (e.g., 0-1 loss) always yield O(1/n)O(1/\sqrt{n}) concentration. which is exactly what appears in ERM generalization bounds for finite hypothesis classes.

Failure Mode

The bound treats all bounded distributions the same: a constant taking value 00 and a uniform on [1,1][-1, 1] get the same sub-Gaussian parameter. For distributions concentrated near 00, the variance-based Bernstein inequality gives tighter bounds. Hoeffding's lemma is worst-case over the bounded class.

Canonical Examples

Example

Gaussian random variable

If XN(0,σ2)X \sim \mathcal{N}(0, \sigma^2), then E[eλX]=eσ2λ2/2\mathbb{E}[e^{\lambda X}] = e^{\sigma^2 \lambda^2/2} exactly. Gaussians are sub-Gaussian with parameter exactly σ\sigma. This is the "prototype". The definition is designed so that Gaussians saturate the bound.

Example

Rademacher random variable

Let ε\varepsilon be a Rademacher variable: P(ε=+1)=P(ε=1)=1/2\mathbb{P}(\varepsilon = +1) = \mathbb{P}(\varepsilon = -1) = 1/2. Then ε[1,1]\varepsilon \in [-1, 1], so by Hoeffding's lemma, ε\varepsilon is sub-Gaussian with parameter σ=1\sigma = 1. In fact, the exact MGF is E[eλε]=cosh(λ)eλ2/2\mathbb{E}[e^{\lambda \varepsilon}] = \cosh(\lambda) \leq e^{\lambda^2/2}, confirming sub-Gaussianity directly.

Rademacher variables appear everywhere in symmetrization arguments and Rademacher complexity.

Example

Bounded random variable

If X[a,b]X \in [a, b] almost surely with E[X]=0\mathbb{E}[X] = 0, then XX is sub-Gaussian with parameter (ba)/2(b-a)/2 by Hoeffding's lemma. This covers:

  • 0-1 loss: σ=1/2\sigma = 1/2
  • Any loss bounded in [0,M][0, M] (after centering): σ=M/2\sigma = M/2
  • Features bounded in [B,B][-B, B]: σ=B\sigma = B
Example

Sum of independent sub-Gaussians

If X1,,XnX_1, \ldots, X_n are independent, centered, sub-Gaussian with parameters σ1,,σn\sigma_1, \ldots, \sigma_n, then S=i=1nXiS = \sum_{i=1}^n X_i is sub-Gaussian with parameter σ=σ12++σn2\sigma = \sqrt{\sigma_1^2 + \cdots + \sigma_n^2}.

This follows because MGFs multiply under independence: E[eλS]=iE[eλXi]ieσi2λ2/2=eλ2iσi2/2\mathbb{E}[e^{\lambda S}] = \prod_i \mathbb{E}[e^{\lambda X_i}] \leq \prod_i e^{\sigma_i^2 \lambda^2/2} = e^{\lambda^2 \sum_i \sigma_i^2/2}.

For the sample mean Xˉ=S/n\bar{X} = S/n with equal σi=σ\sigma_i = \sigma: sub-Gaussian parameter is σ/n\sigma/\sqrt{n}, giving the familiar O(1/n)O(1/\sqrt{n}) concentration.

Common Confusions

Watch Out

Sub-Gaussian parameter is NOT the standard deviation

The sub-Gaussian parameter σ\sigma is an upper bound on the effective spread, but it is not equal to Var(X)1/2\text{Var}(X)^{1/2} in general. For a variable supported on [1,1][-1, 1] with variance 0.010.01, Hoeffding gives σ=1\sigma = 1, far larger than sd(X)=0.1\text{sd}(X) = 0.1. This is why Bernstein inequalities (which use variance information) are sometimes much tighter than Hoeffding-type bounds.

Watch Out

Not all 'nice' distributions are sub-Gaussian

Exponential, Poisson, and chi-squared random variables are not sub-Gaussian . Their tails decay like ecte^{-ct} rather than ect2e^{-ct^2}. These are sub-exponential. The distinction matters: sub-exponential concentration gives O(1/n)O(1/\sqrt{n}) rates only for the mean, not for individual deviations, and requires separate treatment of "small λ\lambda" and "large λ\lambda" regimes.

Watch Out

The centering assumption is essential

The MGF condition E[eλX]eσ2λ2/2\mathbb{E}[e^{\lambda X}] \leq e^{\sigma^2\lambda^2/2} implicitly requires E[X]=0\mathbb{E}[X] = 0. If E[X]=μ0\mathbb{E}[X] = \mu \neq 0, you apply the definition to XμX - \mu, not to XX directly. Failing to center will give you wrong tail bounds.

Why Sub-Gaussian is the Right Abstraction

Three reasons sub-Gaussianity appears everywhere in ML theory:

  1. Closure under summation: sums of independent sub-Gaussians are sub-Gaussian. This is exactly what you need for sample averages.

  2. Captures all bounded losses: every bounded loss function produces sub-Gaussian empirical averages. Since most learning theory starts with bounded losses, sub-Gaussianity is the natural first assumption.

  3. Tight enough for optimal rates: sub-Gaussian concentration gives O(1/n)O(1/\sqrt{n}) deviation bounds for sample means, which matches the CLT rate. You cannot do better in general.

The sub-Gaussian class is the largest class of distributions for which the Chernoff method gives Gaussian-quality tail bounds. Going beyond sub-Gaussian (to sub-exponential, or heavier tails) requires different tools and yields weaker concentration.

The Orlicz Norm Hierarchy

The ψ2\psi_2 norm makes the set of sub-Gaussian random variables a Banach space Lψ2L_{\psi_2}, sitting in a strict inclusion chain:

LLψ2Lψ1Lpfor all p<L^\infty \subset L_{\psi_2} \subset L_{\psi_1} \subset L^p \quad \text{for all } p < \infty

Bounded \subset Sub-Gaussian \subset Sub-exponential \subset All-moments-finite. Each inclusion is strict: a standard Gaussian is sub-Gaussian but not bounded; an Exp(1)\text{Exp}(1) variable is sub-exponential but not sub-Gaussian.

The ψ1\psi_1 norm is defined analogously with ψ1(x)=ex1\psi_1(x) = e^x - 1: Xψ1=inf{t>0:E[eX/t]2}\|X\|_{\psi_1} = \inf\{t > 0 : \mathbb{E}[e^{|X|/t}] \leq 2\}.

Exact ψ2\psi_2 Norms for Common Distributions

DistributionXψ2\|X\|_{\psi_2}Sub-Gaussian σ\sigmaKey step
Rademacher ε=±1\varepsilon = \pm 11/ln21.201/\sqrt{\ln 2} \approx 1.2011ε2=1\varepsilon^2 = 1 always, so solve e1/t2=2e^{1/t^2} = 2
Gaussian ZN(0,1)Z \sim \mathcal{N}(0,1)c1.18c \approx 1.1811E[eZ2/t2]=(12/t2)1/2\mathbb{E}[e^{Z^2/t^2}] = (1 - 2/t^2)^{-1/2}, solve =2= 2
Bounded X[a,b]X \in [a,b], centeredC(ba)C(b-a)(ba)/2(b-a)/2Hoeffding's lemma gives σ=(ba)/2\sigma = (b-a)/2
Gaussian ZN(0,σ2)Z \sim \mathcal{N}(0, \sigma^2)CσC\sigmaσ\sigmaHomogeneity: σZψ2=σZψ2\|\sigma Z\|_{\psi_2} = \sigma \|Z\|_{\psi_2}

Closure Properties

Sub-Gaussianity is useful precisely because it is preserved under the operations that appear in proofs.

Theorem

Closure Under Independent Sums

Statement

For any a=(a1,,an)Rna = (a_1, \ldots, a_n) \in \mathbb{R}^n:

iaiXiψ2CKa2\left\|\sum_i a_i X_i\right\|_{\psi_2} \leq CK \|a\|_2

where CC is a universal constant.

Intuition

By independence, MGFs multiply: E[eλaiXi]=iE[eλaiXi]ieC2K2λ2ai2/2=eC2K2λ2a22/2\mathbb{E}[e^{\lambda \sum a_i X_i}] = \prod_i \mathbb{E}[e^{\lambda a_i X_i}] \leq \prod_i e^{C^2 K^2 \lambda^2 a_i^2 / 2} = e^{C^2 K^2 \lambda^2 \|a\|_2^2 / 2}. The 2\ell_2 norm of aa controls the sub-Gaussian parameter of the sum.

Failure Mode

Without independence, the bound fails. If X1=X2==Xn=XX_1 = X_2 = \cdots = X_n = X, then Xi=nX\sum X_i = nX with nXψ2=nXψ2\|nX\|_{\psi_2} = n\|X\|_{\psi_2}, but the theorem would predict CKnCK\sqrt{n}. The gap between n\sqrt{n} and nn is exactly the cost of perfect correlation.

Other closure properties:

  • Scaling: aXψ2=aXψ2\|aX\|_{\psi_2} = |a| \cdot \|X\|_{\psi_2} (immediate from definition).
  • Lipschitz maps: if ff is LL-Lipschitz with f(0)=0f(0) = 0, then f(X)ψ2CLXψ2\|f(X)\|_{\psi_2} \leq CL \cdot \|X\|_{\psi_2}.
  • Products break sub-Gaussianity: if X,YX, Y are independent sub-Gaussian, then XYXY is sub-exponential, not sub-Gaussian. In particular, X2X^2 is sub-exponential whenever XX is sub-Gaussian. This is exactly why Bernstein's inequality has two regimes.

Sub-Gaussian Maxima

Theorem

Maximum of Sub-Gaussians

Statement

E ⁣[maxinXi]CKlogn\mathbb{E}\!\left[\max_{i \leq n} X_i\right] \leq CK\sqrt{\log n}

Moreover, for any t0t \geq 0:

P ⁣(maxinXiCKlogn+t)2et2/(C2K2)\mathbb{P}\!\left(\max_{i \leq n} X_i \geq CK\sqrt{\log n} + t\right) \leq 2e^{-t^2/(C^2 K^2)}

Intuition

For any λ>0\lambda > 0: E[maxXi]1λlogiE[eλXi]1λ(logn+C2K2λ2/2)\mathbb{E}[\max X_i] \leq \frac{1}{\lambda}\log\sum_i \mathbb{E}[e^{\lambda X_i}] \leq \frac{1}{\lambda}(\log n + C^2 K^2 \lambda^2/2). Setting λ=2logn/(CK)\lambda = \sqrt{2\log n}/(CK) gives CK2lognCK\sqrt{2\log n}.

Why It Matters

This is where the logH\log|H| term in PAC bounds comes from. Bounding suphHR(h)R^(h)\sup_{h \in H}|R(h) - \hat{R}(h)| over a finite hypothesis class H=n|H| = n is bounding the maximum of nn sub-Gaussian variables. The logn\sqrt{\log n} overhead is the price of uniformity.

Failure Mode

For sub-exponential variables, the maximum grows as logn\log n (not logn\sqrt{\log n}). The heavier tails make the worst case worse. For heavy-tailed variables (polynomial tails), the maximum can grow polynomially in nn.

The Sub-Exponential Bridge

The relationship between sub-Gaussian and sub-exponential is precise:

  • If XX is sub-Gaussian, then X2X^2 is sub-exponential with X2ψ1CXψ22\|X^2\|_{\psi_1} \leq C\|X\|_{\psi_2}^2.
  • Conversely, if X2X^2 is sub-exponential, then XX is sub-Gaussian.

This explains why Bernstein's inequality for centered sub-exponential variables Xi\sum X_i has two regimes:

P ⁣(Xit)2exp ⁣(cmin ⁣(t2Xiψ12,  tmaxXiψ1))\mathbb{P}\!\left(\left|\sum X_i\right| \geq t\right) \leq 2\exp\!\left(-c \cdot \min\!\left(\frac{t^2}{\sum \|X_i\|_{\psi_1}^2},\; \frac{t}{\max \|X_i\|_{\psi_1}}\right)\right)

For small tt, the t2t^2 term dominates (sub-Gaussian regime). For large tt, the tt term dominates (sub-exponential regime, heavier tail). The crossover happens at tXiψ1/maxXiψ1t \approx \sum\|X_i\|_{\psi_1}/\max\|X_i\|_{\psi_1}.

Proof Checklist

When writing or reading a sub-Gaussian argument:

  1. Identify the random variable. What quantity do you want to concentrate? Write it as Z=g(X1,,Xn)Z = g(X_1, \ldots, X_n).
  2. Center it. Work with ZE[Z]Z - \mathbb{E}[Z].
  3. Decompose if needed. Express ZE[Z]Z - \mathbb{E}[Z] as a sum or martingale difference sequence.
  4. Check sub-Gaussianity of each piece. Bounded? Hoeffding gives σi\sigma_i. Lipschitz of Gaussian? Gaussian concentration applies.
  5. Propagate. Independent sum: σ2=σi2\sigma^2 = \sum \sigma_i^2. Martingale: Azuma-Hoeffding. Linear combination: a2\|a\|_2 scaling.
  6. Apply tail bound. P(ZE[Z]t)2et2/(2σ2)\mathbb{P}(|Z - \mathbb{E}[Z]| \geq t) \leq 2e^{-t^2/(2\sigma^2)}. Invert to get confidence intervals.

Exercises

ExerciseCore

Problem

Let XX be a Rademacher random variable (±1\pm 1 with equal probability). Verify directly that E[eλX]=cosh(λ)eλ2/2\mathbb{E}[e^{\lambda X}] = \cosh(\lambda) \leq e^{\lambda^2/2} for all λ\lambda.

ExerciseCore

Problem

Let X1,,XnX_1, \ldots, X_n be i.i.d. with Xi[0,1]X_i \in [0, 1] and E[Xi]=μ\mathbb{E}[X_i] = \mu. Using Hoeffding's lemma and the Chernoff method, prove that:

P ⁣(Xˉnμt)e2nt2\mathbb{P}\!\bigl(\bar{X}_n - \mu \geq t\bigr) \leq e^{-2nt^2}

ExerciseAdvanced

Problem

Show that the exponential distribution XExp(1)X \sim \text{Exp}(1) is not sub-Gaussian. Specifically, show that E[eλ(X1)]\mathbb{E}[e^{\lambda(X-1)}] cannot be bounded by eCλ2e^{C\lambda^2} for any constant CC when λ\lambda is large.

ExerciseAdvanced

Problem

Compute the exact ψ2\psi_2 norm of a Rademacher variable ε=±1\varepsilon = \pm 1 with equal probability. Then verify that the MGF bound gives σ=1\sigma = 1 as the sub-Gaussian parameter.

ExerciseAdvanced

Problem

Let X1,,XnX_1, \ldots, X_n be independent sub-Gaussian with Xiψ2K\|X_i\|_{\psi_2} \leq K. Show that E[maxinXi]CKlogn\mathbb{E}[\max_{i \leq n} X_i] \leq CK\sqrt{\log n} using the argument: exponentiate, bound by sum, apply MGF bound, optimize λ\lambda.

Related Comparisons

References

  1. Vershynin, High-Dimensional Probability (2018), Chapters 2-3. Definitive modern treatment using the Orlicz norm framework. Proposition 2.5.2 for the equivalence theorem.
  2. Boucheron, Lugosi, Massart, Concentration Inequalities (2013), Chapter 2. Uses the entropy method rather than Orlicz norms.
  3. Wainwright, High-Dimensional Statistics (2019), Chapter 2. Clear exposition connecting sub-Gaussian theory to statistical applications.
  4. van Handel, Probability in High Dimension (2016), Chapters 1-2. Excellent treatment of sub-Gaussian maxima and chaining.
  5. Shalev-Shwartz and Ben-David, Understanding Machine Learning (2014), Chapters 26-28. How sub-Gaussian bounds enter learning theory.
  6. Rigollet and Hutter, High-Dimensional Statistics (MIT lecture notes, 2023), Chapter 1. Concise derivation of all five characterizations with explicit constants.

Next Topics

Natural next steps from sub-Gaussian random variables:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics