Concentration Probability
Sub-Gaussian Random Variables
Sub-Gaussian random variables: the precise characterization of 'light-tailed' behavior that underpins every concentration inequality in learning theory.
Prerequisites
Why This Matters
Concentration inequalities are the engine of learning theory. Every generalization bound you will encounter. from VC dimension to Rademacher complexity to PAC-Bayes. relies on controlling how far a sample average deviates from its expectation. Sub-Gaussian random variables are the precise class for which this control is as tight as if the variable were Gaussian, even when it is not.
If you understand sub-Gaussianity, you understand why concentration works. If you do not, every bound you see will feel like magic.
Mental Model
A random variable is sub-Gaussian if its tails decay at least as fast as those of a Gaussian. Concretely: the probability of being far from the mean drops like , not merely (sub-exponential) or (polynomial). This exponential-squared decay is what gives you the rates in learning theory.
The key insight: many non-Gaussian distributions. bounded random variables, Rademacher random variables, projections of high-dimensional uniform distributions. all satisfy this same tail behavior. Sub-Gaussianity captures exactly what these distributions have in common.
Formal Setup and Notation
Let be a real-valued random variable with (we center without loss of generality).
Sub-Gaussian Random Variable (MGF characterization)
A centered random variable is sub-Gaussian with parameter if for all :
The smallest such is called the sub-Gaussian parameter (or sub-Gaussian proxy variance) of .
This is the workhorse definition. It says the moment generating function of is dominated by that of a random variable.
Sub-Gaussian Norm (Orlicz norm)
The sub-Gaussian norm (or -norm) of is:
A random variable is sub-Gaussian if and only if . The sub-Gaussian parameter and the -norm are related by (up to universal constants).
Tail Characterization
An equivalent characterization: is sub-Gaussian with parameter if and only if for all :
This is the form you will use most often in practice.
Equivalent Characterizations
The following conditions are equivalent (up to constants in the parameters):
- MGF condition: for all
- Tail condition: for all
- Moment condition: for all
- Orlicz norm:
The constants differ by at most universal multiplicative factors. This equivalence is what makes the sub-Gaussian class robust: you can verify membership via whichever characterization is most convenient.
Main Theorems
Sub-Gaussian MGF Implies Tail Bound
Statement
If is a centered random variable satisfying for all , then for all :
Intuition
The MGF condition is a generating condition. It controls all moments simultaneously. The tail bound is a consequence obtained by the Chernoff method: exponentiate, apply Markov, and optimize over . The quadratic exponent in the MGF translates directly to a quadratic exponent in the tail.
Proof Sketch
For any , by Markov's inequality:
Minimize over : set to get .
Why It Matters
This is the Chernoff method: the single most important proof technique in concentration. Every time you see an exponential tail bound, this is the engine underneath. Mastering this one argument gives you access to all of sub-Gaussian concentration theory.
Failure Mode
The bound is only tight up to constants. For a true Gaussian, the bound is off by a polynomial factor in (the exact Gaussian tail has a prefactor). For non-asymptotic purposes this rarely matters.
Hoeffding's Lemma
Statement
If is a centered random variable with almost surely, then for all :
That is, is sub-Gaussian with parameter .
Intuition
Bounded random variables cannot have heavy tails. They literally cannot take values beyond . Hoeffding's lemma quantifies this: the MGF of any bounded centered variable is dominated by a Gaussian MGF. The factor of comes from a convexity argument applied to on the interval .
Proof Sketch
Since , write for some random . By convexity of :
Use to express in terms of . Then apply the inequality for (which follows from Taylor expansion and bounding the remainder). Setting gives the result.
Why It Matters
Hoeffding's lemma is the bridge between "bounded" and "sub-Gaussian." It explains why bounded losses (e.g., 0-1 loss) always yield concentration. which is exactly what appears in ERM generalization bounds for finite hypothesis classes.
Failure Mode
The bound treats all bounded distributions the same: a constant taking value and a uniform on get the same sub-Gaussian parameter. For distributions concentrated near , the variance-based Bernstein inequality gives tighter bounds. Hoeffding's lemma is worst-case over the bounded class.
Canonical Examples
Gaussian random variable
If , then exactly. Gaussians are sub-Gaussian with parameter exactly . This is the "prototype". The definition is designed so that Gaussians saturate the bound.
Rademacher random variable
Let be a Rademacher variable: . Then , so by Hoeffding's lemma, is sub-Gaussian with parameter . In fact, the exact MGF is , confirming sub-Gaussianity directly.
Rademacher variables appear everywhere in symmetrization arguments and Rademacher complexity.
Bounded random variable
If almost surely with , then is sub-Gaussian with parameter by Hoeffding's lemma. This covers:
- 0-1 loss:
- Any loss bounded in (after centering):
- Features bounded in :
Sum of independent sub-Gaussians
If are independent, centered, sub-Gaussian with parameters , then is sub-Gaussian with parameter .
This follows because MGFs multiply under independence: .
For the sample mean with equal : sub-Gaussian parameter is , giving the familiar concentration.
Common Confusions
Sub-Gaussian parameter is NOT the standard deviation
The sub-Gaussian parameter is an upper bound on the effective spread, but it is not equal to in general. For a variable supported on with variance , Hoeffding gives , far larger than . This is why Bernstein inequalities (which use variance information) are sometimes much tighter than Hoeffding-type bounds.
Not all 'nice' distributions are sub-Gaussian
Exponential, Poisson, and chi-squared random variables are not sub-Gaussian . Their tails decay like rather than . These are sub-exponential. The distinction matters: sub-exponential concentration gives rates only for the mean, not for individual deviations, and requires separate treatment of "small " and "large " regimes.
The centering assumption is essential
The MGF condition implicitly requires . If , you apply the definition to , not to directly. Failing to center will give you wrong tail bounds.
Why Sub-Gaussian is the Right Abstraction
Three reasons sub-Gaussianity appears everywhere in ML theory:
-
Closure under summation: sums of independent sub-Gaussians are sub-Gaussian. This is exactly what you need for sample averages.
-
Captures all bounded losses: every bounded loss function produces sub-Gaussian empirical averages. Since most learning theory starts with bounded losses, sub-Gaussianity is the natural first assumption.
-
Tight enough for optimal rates: sub-Gaussian concentration gives deviation bounds for sample means, which matches the CLT rate. You cannot do better in general.
The sub-Gaussian class is the largest class of distributions for which the Chernoff method gives Gaussian-quality tail bounds. Going beyond sub-Gaussian (to sub-exponential, or heavier tails) requires different tools and yields weaker concentration.
The Orlicz Norm Hierarchy
The norm makes the set of sub-Gaussian random variables a Banach space , sitting in a strict inclusion chain:
Bounded Sub-Gaussian Sub-exponential All-moments-finite. Each inclusion is strict: a standard Gaussian is sub-Gaussian but not bounded; an variable is sub-exponential but not sub-Gaussian.
The norm is defined analogously with : .
Exact Norms for Common Distributions
| Distribution | Sub-Gaussian | Key step | |
|---|---|---|---|
| Rademacher | always, so solve | ||
| Gaussian | , solve | ||
| Bounded , centered | Hoeffding's lemma gives | ||
| Gaussian | Homogeneity: |
Closure Properties
Sub-Gaussianity is useful precisely because it is preserved under the operations that appear in proofs.
Closure Under Independent Sums
Statement
For any :
where is a universal constant.
Intuition
By independence, MGFs multiply: . The norm of controls the sub-Gaussian parameter of the sum.
Failure Mode
Without independence, the bound fails. If , then with , but the theorem would predict . The gap between and is exactly the cost of perfect correlation.
Other closure properties:
- Scaling: (immediate from definition).
- Lipschitz maps: if is -Lipschitz with , then .
- Products break sub-Gaussianity: if are independent sub-Gaussian, then is sub-exponential, not sub-Gaussian. In particular, is sub-exponential whenever is sub-Gaussian. This is exactly why Bernstein's inequality has two regimes.
Sub-Gaussian Maxima
Maximum of Sub-Gaussians
Statement
Moreover, for any :
Intuition
For any : . Setting gives .
Why It Matters
This is where the term in PAC bounds comes from. Bounding over a finite hypothesis class is bounding the maximum of sub-Gaussian variables. The overhead is the price of uniformity.
Failure Mode
For sub-exponential variables, the maximum grows as (not ). The heavier tails make the worst case worse. For heavy-tailed variables (polynomial tails), the maximum can grow polynomially in .
The Sub-Exponential Bridge
The relationship between sub-Gaussian and sub-exponential is precise:
- If is sub-Gaussian, then is sub-exponential with .
- Conversely, if is sub-exponential, then is sub-Gaussian.
This explains why Bernstein's inequality for centered sub-exponential variables has two regimes:
For small , the term dominates (sub-Gaussian regime). For large , the term dominates (sub-exponential regime, heavier tail). The crossover happens at .
Proof Checklist
When writing or reading a sub-Gaussian argument:
- Identify the random variable. What quantity do you want to concentrate? Write it as .
- Center it. Work with .
- Decompose if needed. Express as a sum or martingale difference sequence.
- Check sub-Gaussianity of each piece. Bounded? Hoeffding gives . Lipschitz of Gaussian? Gaussian concentration applies.
- Propagate. Independent sum: . Martingale: Azuma-Hoeffding. Linear combination: scaling.
- Apply tail bound. . Invert to get confidence intervals.
Exercises
Problem
Let be a Rademacher random variable ( with equal probability). Verify directly that for all .
Problem
Let be i.i.d. with and . Using Hoeffding's lemma and the Chernoff method, prove that:
Problem
Show that the exponential distribution is not sub-Gaussian. Specifically, show that cannot be bounded by for any constant when is large.
Problem
Compute the exact norm of a Rademacher variable with equal probability. Then verify that the MGF bound gives as the sub-Gaussian parameter.
Problem
Let be independent sub-Gaussian with . Show that using the argument: exponentiate, bound by sum, apply MGF bound, optimize .
Related Comparisons
References
- Vershynin, High-Dimensional Probability (2018), Chapters 2-3. Definitive modern treatment using the Orlicz norm framework. Proposition 2.5.2 for the equivalence theorem.
- Boucheron, Lugosi, Massart, Concentration Inequalities (2013), Chapter 2. Uses the entropy method rather than Orlicz norms.
- Wainwright, High-Dimensional Statistics (2019), Chapter 2. Clear exposition connecting sub-Gaussian theory to statistical applications.
- van Handel, Probability in High Dimension (2016), Chapters 1-2. Excellent treatment of sub-Gaussian maxima and chaining.
- Shalev-Shwartz and Ben-David, Understanding Machine Learning (2014), Chapters 26-28. How sub-Gaussian bounds enter learning theory.
- Rigollet and Hutter, High-Dimensional Statistics (MIT lecture notes, 2023), Chapter 1. Concise derivation of all five characterizations with explicit constants.
Next Topics
Natural next steps from sub-Gaussian random variables:
- Sub-exponential random variables: the next distributional class, for heavier tails
- Epsilon-nets and covering numbers: combining sub-Gaussian concentration with geometric arguments
- Hanson-Wright inequality: sub-Gaussian concentration for quadratic forms
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
Builds on This
- Epsilon-Nets and Covering NumbersLayer 3
- Hanson-Wright InequalityLayer 3
- Matrix ConcentrationLayer 3
- McDiarmid's InequalityLayer 3
- Measure Concentration and Geometric Functional AnalysisLayer 3
- Restricted Isometry PropertyLayer 3
- Sparse Recovery and Compressed SensingLayer 4
- Sub-Exponential Random VariablesLayer 2