Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Concentration Probability

Sub-Exponential Random Variables

The distributional class between sub-Gaussian and heavy-tailed: heavier tails than Gaussian, the psi_1 norm, Bernstein condition, and the two-regime concentration bound.

CoreTier 1Stable~55 min

Why This Matters

Sub-Gaussian random variables have tails decaying like ect2e^{-ct^2} and give the cleanest concentration bounds. But many quantities in ML are not sub-Gaussian:

  • Chi-squared random variables (sums of squared Gaussians)
  • Products of sub-Gaussian random variables
  • Exponential random variables (waiting times)
  • Squared losses in regression with Gaussian noise

These have tails decaying like ecte^{-ct}. exponentially, but with a linear exponent rather than quadratic. The sub-exponential class captures exactly this level of tail behavior, and the Bernstein inequality for sub-exponential sums provides the right concentration bound: sub-Gaussian for small deviations, exponential for large deviations.

Mental Model

The tail decay hierarchy is:

ClassTail decayMGFExample
Sub-Gaussianect2e^{-ct^2}Finite for all λ\lambdaBounded, Gaussian
Sub-exponentialecte^{-ct}Finite for $\lambda
Heavy-tailed1/tp1/t^pMay not existCauchy, Pareto

Sub-exponential is the intermediate class. The MGF exists in a neighborhood of zero (not everywhere), which means the Chernoff method works for small λ\lambda but breaks for large λ\lambda. This produces a concentration bound with two regimes.

Formal Setup and Definitions

Let XX be a centered random variable (E[X]=0\mathbb{E}[X] = 0).

Definition

Sub-Exponential Random Variable (MGF characterization)

A centered random variable XX is sub-exponential with parameters (ν2,b)(\nu^2, b) if for all λ<1/b|\lambda| < 1/b:

E[eλX]exp ⁣(ν2λ22)\mathbb{E}[e^{\lambda X}] \leq \exp\!\left(\frac{\nu^2 \lambda^2}{2}\right)

The parameter ν2\nu^2 plays the role of a "variance proxy" (controlling the sub-Gaussian regime), and bb controls the radius where the MGF bound holds.

Definition

Sub-Exponential Norm (Orlicz psi_1 norm)

The sub-exponential norm (or ψ1\psi_1-norm) of XX is:

Xψ1=inf{t>0:E[eX/t]2}\|X\|_{\psi_1} = \inf\{t > 0 : \mathbb{E}[e^{|X|/t}] \leq 2\}

A random variable is sub-exponential if and only if Xψ1<\|X\|_{\psi_1} < \infty.

Compare to the sub-Gaussian norm: Xψ2=inf{t>0:E[eX2/t2]2}\|X\|_{\psi_2} = \inf\{t > 0 : \mathbb{E}[e^{X^2/t^2}] \leq 2\}. The ψ1\psi_1 norm uses eX/te^{|X|/t} (linear in X|X|) while ψ2\psi_2 uses eX2/t2e^{X^2/t^2} (quadratic). The linear growth in the exponent is what makes sub-exponential tails heavier than sub-Gaussian.

Definition

Bernstein Condition

A centered random variable XX satisfies the Bernstein condition with parameters (σ2,b)(\sigma^2, b) if for all integers k2k \geq 2:

E[Xk]k!2σ2bk2\mathbb{E}[|X|^k] \leq \frac{k!}{2} \sigma^2 b^{k-2}

This is equivalent (up to constants) to being sub-exponential with parameters (σ2,b)(\sigma^2, b). The condition controls all moments: the kk-th moment grows at most like k!bkk! \cdot b^k, compared to sub-Gaussian where moments grow like (Ck)k(C\sqrt{k})^k.

Equivalent Characterizations

The following are equivalent (up to constants in parameters):

  1. MGF condition: E[eλX]eν2λ2/2\mathbb{E}[e^{\lambda X}] \leq e^{\nu^2\lambda^2/2} for λ<1/b|\lambda| < 1/b
  2. Tail condition: P(Xt)2exp(t/C1)\mathbb{P}(|X| \geq t) \leq 2\exp(-t/C_1) for all t>0t > 0
  3. Moment condition: (E[Xk])1/kC2k(\mathbb{E}[|X|^k])^{1/k} \leq C_2 k for all k1k \geq 1
  4. Orlicz norm: Xψ1C3\|X\|_{\psi_1} \leq C_3
  5. Bernstein condition: E[Xk]k!2σ2bk2\mathbb{E}[|X|^k] \leq \frac{k!}{2}\sigma^2 b^{k-2} for k2k \geq 2

Compare to sub-Gaussian characterization 3: (E[Xk])1/kCk(\mathbb{E}[|X|^k])^{1/k} \leq C\sqrt{k}. Sub-exponential moments grow like kk; sub-Gaussian moments grow like k\sqrt{k}. The extra factor of k\sqrt{k} is the precise difference between the two classes.

Main Theorems

Theorem

Products of Sub-Gaussians are Sub-Exponential

Statement

If XX and YY are sub-Gaussian random variables (not necessarily independent), then the product XYXY is sub-exponential, with:

XYψ1Xψ2Yψ2\|XY\|_{\psi_1} \leq \|X\|_{\psi_2} \cdot \|Y\|_{\psi_2}

Intuition

Multiplying two "Gaussian-like" variables produces something "exponential-like." A concrete example: if ZN(0,1)Z \sim \mathcal{N}(0, 1), then ZZ is sub-Gaussian, but Z2=ZZZ^2 = Z \cdot Z is chi-squared (with 1 degree of freedom) and sub-exponential. The tails get heavier because multiplying two variables that can be large makes the product even larger.

Proof Sketch

For any t>0t > 0: P(XYt)=P(XYt)\mathbb{P}(|XY| \geq t) = \mathbb{P}(|X| \cdot |Y| \geq t)

Split into cases: either Xt|X| \geq \sqrt{t} or Yt|Y| \geq \sqrt{t} (or both). By union bound:

P(XYt)P(Xt)+P(Yt)\mathbb{P}(|XY| \geq t) \leq \mathbb{P}(|X| \geq \sqrt{t}) + \mathbb{P}(|Y| \geq \sqrt{t})

Since XX is sub-Gaussian: P(Xt)2ect\mathbb{P}(|X| \geq \sqrt{t}) \leq 2e^{-ct}. Similarly for YY. So P(XYt)4ect\mathbb{P}(|XY| \geq t) \leq 4e^{-c't}, which is a sub-exponential tail.

For the norm bound, use the identity XY(X2+Y2)/2|XY| \leq (X^2 + Y^2)/2 and the relationship between ψ1\psi_1 and ψ2\psi_2 norms.

Why It Matters

This explains why chi-squared variables, quadratic forms, and many statistics in learning theory are sub-exponential: they involve products or squares of sub-Gaussian quantities. When you see XXX^\top X or Xw2\|Xw\|^2 in a bound, expect sub-exponential concentration, not sub-Gaussian.

Failure Mode

The product of two sub-exponential variables is generally not sub-exponential (it may be even heavier-tailed). Sub-exponential is not closed under multiplication, unlike sub-Gaussian being closed under addition. This limits the composability of sub-exponential bounds.

Theorem

Bernstein Inequality for Sub-Exponential Sums

Statement

Let X1,,XnX_1, \ldots, X_n be independent centered sub-exponential random variables with parameters (νi2,b)(\nu_i^2, b). Then for any t>0t > 0:

P ⁣(i=1nXit)2exp ⁣(cmin ⁣(t2iνi2,tb))\mathbb{P}\!\left(\left|\sum_{i=1}^n X_i\right| \geq t\right) \leq 2\exp\!\left(-c \min\!\left(\frac{t^2}{\sum_i \nu_i^2},\, \frac{t}{b}\right)\right)

where c>0c > 0 is a universal constant. For i.i.d. variables with common parameter ν2\nu^2 and sample mean Xˉn=1niXi\bar{X}_n = \frac{1}{n}\sum_i X_i:

P ⁣(Xˉnt)2exp ⁣(cnmin ⁣(t2ν2,tb))\mathbb{P}\!\left(|\bar{X}_n| \geq t\right) \leq 2\exp\!\left(-cn\min\!\left(\frac{t^2}{\nu^2},\, \frac{t}{b}\right)\right)

Intuition

The bound has two regimes separated by the threshold t=ν2/bt^* = \nu^2/b:

Small deviations (ttt \leq t^*): The t2/ν2t^2/\nu^2 term dominates the minimum. The bound is exp(cnt2/ν2)\exp(-cnt^2/\nu^2). a sub-Gaussian tail. For moderate deviations, the sub-exponential variable behaves as if it were sub-Gaussian.

Large deviations (t>tt > t^*): The t/bt/b term dominates. The bound is exp(cnt/b)\exp(-cnt/b). an exponential (not Gaussian) tail. For large deviations, the heavier tails kick in and concentration is weaker.

The transition between regimes is smooth. The sub-Gaussian regime gives the familiar log(1/δ)/n\sqrt{\log(1/\delta)/n} rates for moderate confidence levels.

Proof Sketch

Use the Chernoff method. For λ<1/b|\lambda| < 1/b:

P ⁣(iXit)eλtiE[eλXi]eλtexp ⁣(λ2iνi22)\mathbb{P}\!\left(\sum_i X_i \geq t\right) \leq e^{-\lambda t} \prod_i \mathbb{E}[e^{\lambda X_i}] \leq e^{-\lambda t} \exp\!\left(\frac{\lambda^2 \sum_i \nu_i^2}{2}\right)

Optimize over λ(0,1/b)\lambda \in (0, 1/b):

  • If the unconstrained optimum λ=t/iνi2\lambda^* = t/\sum_i\nu_i^2 satisfies λ<1/b\lambda^* < 1/b, we are in the sub-Gaussian regime: bound is exp(t2/(2iνi2))\exp(-t^2/(2\sum_i\nu_i^2)).
  • If λ1/b\lambda^* \geq 1/b, set λ=1/(2b)\lambda = 1/(2b) (at the boundary): bound is exp(t/(2b)+iνi2/(8b2))\exp(-t/(2b) + \sum_i\nu_i^2/(8b^2)). For large tt, this gives exp(ct/b)\exp(-ct/b).

Taking the minimum of the two cases gives the stated bound.

Why It Matters

This is the correct concentration bound for quantities like 1ni(Xi2E[Xi2])\frac{1}{n}\sum_i (X_i^2 - \mathbb{E}[X_i^2]) where each XiX_i is Gaussian. The sub-Gaussian bound does not apply (these are not sub-Gaussian), but the Bernstein bound does. The two-regime behavior is what you actually observe in practice: moderate deviations look Gaussian, but extreme deviations reveal the heavier tails.

Failure Mode

The bound requires knowing (or bounding) both ν2\nu^2 and bb. For the sample variance of Gaussian data, ν2=O(σ4)\nu^2 = O(\sigma^4) and b=O(σ2)b = O(\sigma^2). If these parameters are poorly estimated, the bound may be loose in one or both regimes.

Key Examples of Sub-Exponential Variables

Example

Chi-squared with k degrees of freedom

If Z1,,ZkN(0,1)Z_1, \ldots, Z_k \sim \mathcal{N}(0, 1) independently, then W=j=1kZj2χ2(k)W = \sum_{j=1}^k Z_j^2 \sim \chi^2(k) with E[W]=k\mathbb{E}[W] = k and Var(W)=2k\text{Var}(W) = 2k.

The centered variable WkW - k is sub-exponential with parameters ν2=O(k)\nu^2 = O(k) and b=O(1)b = O(1). It is not sub-Gaussian because each Zj2Z_j^2 has MGF E[eλZj2]=(12λ)1/2\mathbb{E}[e^{\lambda Z_j^2}] = (1 - 2\lambda)^{-1/2} which diverges at λ=1/2\lambda = 1/2. A sub-Gaussian MGF would be finite for all λ\lambda.

The Bernstein bound gives: P(Wkt)2exp(cmin(t2/k,t))\mathbb{P}(|W - k| \geq t) \leq 2\exp(-c\min(t^2/k, t)).

Example

Exponential distribution

If XExp(1)X \sim \text{Exp}(1), then X1X - 1 is centered with E[eλ(X1)]=eλ/(1λ)\mathbb{E}[e^{\lambda(X-1)}] = e^{-\lambda}/(1 - \lambda) for λ<1\lambda < 1.

This is finite only for λ<1\lambda < 1, confirming sub-exponential (not sub-Gaussian) behavior. The ψ1\psi_1 norm is X1ψ1=O(1)\|X - 1\|_{\psi_1} = O(1).

Example

Product of two independent standard normals

Let Z1,Z2N(0,1)Z_1, Z_2 \sim \mathcal{N}(0, 1) independently. The product Z1Z2Z_1 Z_2 has E[Z1Z2]=0\mathbb{E}[Z_1 Z_2] = 0 and the tail bound:

P(Z1Z2t)4et\mathbb{P}(|Z_1 Z_2| \geq t) \leq 4e^{-\sqrt{t}}

which is sub-exponential (tail ecte^{-ct} after substitution s=ts = t). Indeed Z1Z2ψ1Z1ψ2Z2ψ2=O(1)\|Z_1 Z_2\|_{\psi_1} \leq \|Z_1\|_{\psi_2} \|Z_2\|_{\psi_2} = O(1).

Sub-Exponential Closure Properties

  1. Sum of independents: If X1,,XnX_1, \ldots, X_n are independent sub-exponential, then iXi\sum_i X_i is sub-exponential with iXiψ1CiXiψ1\|\sum_i X_i\|_{\psi_1} \leq C \sum_i \|X_i\|_{\psi_1}. (Note: unlike sub-Gaussian, the norm adds linearly, not in quadrature.)

  2. Scalar multiplication: aXψ1=aXψ1\|aX\|_{\psi_1} = |a| \cdot \|X\|_{\psi_1}.

  3. Product of sub-Gaussians: XYψ1Xψ2Yψ2\|XY\|_{\psi_1} \leq \|X\|_{\psi_2} \|Y\|_{\psi_2}.

  4. Not closed under products: The product of two sub-exponential variables may be heavier than sub-exponential.

Connection to Bernstein's Classical Inequality

The classical Bernstein inequality (from the concentration-inequalities page) for bounded random variables:

P ⁣(i(Xiμi)t)exp ⁣(t2/2iσi2+Mt/3)\mathbb{P}\!\left(\sum_i (X_i - \mu_i) \geq t\right) \leq \exp\!\left(-\frac{t^2/2}{\sum_i \sigma_i^2 + Mt/3}\right)

is a special case of the sub-exponential Bernstein inequality. Bounded random variables are sub-exponential (since they are even sub-Gaussian), and the denominator σi2+Mt/3\sum\sigma_i^2 + Mt/3 captures both regimes: sub-Gaussian when tσi2/Mt \ll \sum\sigma_i^2/M and exponential when tσi2/Mt \gg \sum\sigma_i^2/M.

Common Confusions

Watch Out

Sub-exponential is BETWEEN sub-Gaussian and heavy-tailed

Every sub-Gaussian variable is sub-exponential, but not vice versa. Sub-exponential is a weaker condition (allows heavier tails). The hierarchy is: bounded \subset sub-Gaussian \subset sub-exponential \subset finite variance \subset all distributions.

Watch Out

The MGF bound holds only for small lambda

For sub-Gaussian variables, E[eλX]eCλ2\mathbb{E}[e^{\lambda X}] \leq e^{C\lambda^2} for all λ\lambda. For sub-exponential, the bound holds only for λ<1/b|\lambda| < 1/b. Beyond this, the MGF may diverge. This is why the Chernoff method produces two regimes: you can only optimize λ\lambda up to the boundary 1/b1/b.

Watch Out

psi_1 norm vs psi_2 norm

ψ2\psi_2 (sub-Gaussian) uses eX2/t2e^{X^2/t^2} in the definition. the square of XX in the exponent. ψ1\psi_1 (sub-exponential) uses eX/te^{|X|/t}. The absolute value of XX. The square in ψ2\psi_2 is what forces the tails to be Gaussian-like. Every sub-Gaussian variable is sub-exponential because eX/t1+eX2/t2e^{|X|/t} \leq 1 + e^{X^2/t^2} for appropriate tt.

Summary

  • Sub-exponential tails: P(Xt)2ect\mathbb{P}(|X| \geq t) \leq 2e^{-ct} (linear exponent)
  • Sub-Gaussian tails: P(Xt)2ect2\mathbb{P}(|X| \geq t) \leq 2e^{-ct^2} (quadratic exponent)
  • Products of sub-Gaussians are sub-exponential (ψ2ψ2ψ1\psi_2 \cdot \psi_2 \to \psi_1)
  • Chi-squared and squared losses are sub-exponential, not sub-Gaussian
  • Bernstein bound has two regimes: sub-Gaussian for small tt, exponential for large tt
  • Threshold between regimes: t=ν2/bt^* = \nu^2/b
  • ψ1\psi_1 norm adds linearly for sums; ψ2\psi_2 norm adds in quadrature

Exercises

ExerciseCore

Problem

Show that XExp(1)X \sim \text{Exp}(1) is sub-exponential but not sub-Gaussian. Compute the MGF E[eλ(X1)]\mathbb{E}[e^{\lambda(X-1)}] and show it is finite only for λ<1\lambda < 1.

ExerciseCore

Problem

Let ZN(0,1)Z \sim \mathcal{N}(0, 1). Show that Z21Z^2 - 1 (centered chi-squared with 1 degree of freedom) is sub-exponential by verifying the tail bound P(Z21t)ect\mathbb{P}(Z^2 - 1 \geq t) \leq e^{-ct} for large tt.

ExerciseAdvanced

Problem

Let X1,,XnN(0,σ2)X_1, \ldots, X_n \sim \mathcal{N}(0, \sigma^2) independently. Using the Bernstein inequality for sub-exponential variables, bound P ⁣(1niXi2σ2t)\mathbb{P}\!\left(\left|\frac{1}{n}\sum_i X_i^2 - \sigma^2\right| \geq t\right). Identify the two regimes.

Related Comparisons

References

Canonical:

Current:

  • Wainwright, High-Dimensional Statistics (2019), Chapter 2

  • Rigollet & Hutter, High-Dimensional Statistics (MIT lecture notes, 2023)

  • van Handel, Probability in High Dimension (2016), Chapters 1-3

Next Topics

Building on sub-exponential theory:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics