Hanson-Wright Inequality

Sneiderman, Robby

Concentration Probability

Hanson-Wright Inequality

Concentration of quadratic forms XᵀAX for sub-Gaussian random vectors: the two-term bound involving the Frobenius norm (Gaussian regime) and operator norm (extreme regime).

AdvancedTier 2StableSupporting~55 min

Prerequisites

Subgaussian Random Variables Matrix Concentration Chi Squared Concentration

Start 8-question practice · 2 available Prereq Map

Learning position

Read this page in the graph.

concentration-probability | layer 3 | tier 2. This page has 3 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Random Matrix Theory Overview

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Scalar concentration inequalities (Hoeffding, Bernstein) control linear functions of independent random variables, sums of the form $\sum_i a_i X_i$ . But many quantities in statistics and machine learning are quadratic: sample covariance entries $\frac{1}{n}\sum_i X_i X_j$ for $i \neq j$ , kernel evaluations $k(x, x') = \phi(x)^\top \phi(x')$ , chi-squared statistics, and second-order U-statistics. For these, you need concentration of the quadratic form $X^\top A X$ where $X$ is a random vector with independent entries.

The Hanson-Wright inequality is the definitive tool for this. It gives a two-term bound that captures two different regimes of deviation, and it is tight up to constants.

theorem visual

Quadratic forms have two tail regimes

$Hanson-Wright says small deviations are governed by total matrix energy, while extreme deviations are governed by the largest direction.$

small deviations

$t^{2} /∥ A ∥_{F}^{2}$

$Total energy controls CLT-like fluctuations.$

large deviations

$t /∥ A ∥_{op}$

$The largest singular direction controls rare events.$

the minimum

$min {t^{2} /∥ A ∥_{F}^{2}, t /∥ A ∥_{op}}$

$The weaker exponent determines the visible tail.$

Mental Model

Consider the quadratic form $X^\top A X$ where $X \in \mathbb{R}^n$ has independent sub-Gaussian entries. This is a sum of terms $A_{ij} X_i X_j$ involving products of random variables, not just single variables. Products are harder to control because the tails are heavier (the product of two sub-Gaussians is sub-exponential, not sub-Gaussian).

The Hanson-Wright bound says: the deviation of $X^\top A X$ from its expectation is controlled by two terms:

Frobenius term $\|A\|_F$ : dominates for small deviations, behaves like Gaussian concentration (sub-Gaussian tail $e^{-ct^2}$ )
Operator term $\|A\|_{\text{op}}$ : dominates for large deviations, behaves like sub-exponential concentration (tail $e^{-ct}$ )

The transition between regimes happens at $t \approx \|A\|_F^2 / \|A\|_{\text{op}}$ .

Formal Setup and Notation

Let $X = (X_1, \ldots, X_n) \in \mathbb{R}^n$ be a random vector with independent, centered, sub-Gaussian entries: $\mathbb{E}[X_i] = 0$ and $\|X_i\|_{\psi_2} \leq K$ for all $i$ .

Let $A \in \mathbb{R}^{n \times n}$ be a fixed (deterministic) matrix.

Definition

Quadratic Form $X^{⊤} A X$

The quadratic form associated with matrix $A$ and random vector $X$ is:

$X^\top A X = \sum_{i,j=1}^n A_{ij} X_i X_j$

Its expectation is $\mathbb{E}[X^\top A X] = \sum_i A_{ii} \mathbb{E}[X_i^2] = \text{tr}(A \cdot \text{diag}(\mathbb{E}[X_i^2]))$ . For isotropic $X$ ( $\mathbb{E}[X_i^2] = 1$ ), this simplifies to $\text{tr}(A)$ .

Definition

Frobenius Norm $∥ A ∥_{F}$

The Frobenius norm of $A$ is $\|A\|_F = \sqrt{\sum_{i,j} A_{ij}^2} = \sqrt{\text{tr}(A^\top A)}$ . It measures the total "mass" of $A$ across all entries. In the Hanson-Wright bound, $\|A\|_F$ controls the variance of the quadratic form: the Gaussian-regime fluctuations scale like $\|A\|_F$ .

Definition

Operator Norm $∥ A ∥_{op}$

The operator norm is $\|A\|_{\text{op}} = \sup_{\|v\|=1} \|Av\|$ . It measures the maximum directional stretch. In Hanson-Wright, $\|A\|_{\text{op}}$ controls the extreme-regime tail: large deviations are governed by the single largest singular value of $A$ .

Core Relationship Between Norms

For any $n \times n$ matrix $A$ :

$\|A\|_{\text{op}} \leq \|A\|_F \leq \sqrt{n} \|A\|_{\text{op}}$

The gap between the two norms measures how "spread out" the matrix is. If $A$ has rank 1, $\|A\|_F = \|A\|_{\text{op}}$ and the two terms in Hanson-Wright are comparable. If $A = I_n$ (identity), $\|A\|_F = \sqrt{n}$ while $\|A\|_{\text{op}} = 1$ , and the Frobenius term dominates for all reasonable deviations.

Main Theorems

Theorem

Hanson-Wright Inequality

Statement

Let $X = (X_1, \ldots, X_n)$ have independent, centered, sub-Gaussian components with $\|X_i\|_{\psi_2} \leq K$ . For any $n \times n$ matrix $A$ and any $t > 0$ :

$\mathbb{P}\!\bigl(|X^\top A X - \mathbb{E}[X^\top A X]| \geq t\bigr) \leq 2\exp\!\Bigl(-c \min\!\Bigl(\frac{t^2}{K^4 \|A\|_F^2},\; \frac{t}{K^2 \|A\|_{\text{op}}}\Bigr)\Bigr)$

where $c > 0$ is a universal constant.

Intuition

The bound has two regimes:

Small deviations ( $t \lesssim K^2 \|A\|_F^2 / \|A\|_{\text{op}}$ ): The $t^2 / \|A\|_F^2$ term is smaller, so the bound is $\exp(-ct^2/(K^4\|A\|_F^2))$ . This is a sub-Gaussian tail. It arises because for small deviations, the quadratic form behaves like a sum of many weakly correlated terms, and the CLT-like behavior dominates.

Large deviations ( $t \gtrsim K^2 \|A\|_F^2 / \|A\|_{\text{op}}$ ): The $t / \|A\|_{\text{op}}$ term is smaller, so the bound is $\exp(-ct/(K^2\|A\|_{\text{op}}))$ . This is a sub-exponential tail. It arises because extreme deviations are driven by the largest eigenvalue direction of $A$ , where the quadratic form is $\lambda_{\max}(A) \cdot X_v^2$ for a single sub-Gaussian variable $X_v = v^\top X$ .

The crossover at $t^* = K^2 \|A\|_F^2 / \|A\|_{\text{op}}$ is where the two terms are equal. Below $t^*$ , Gaussian behavior; above $t^*$ , exponential behavior.

Proof Sketch

The proof proceeds in three steps:

Step 1 (Decoupling). Replace $X^\top A X$ with a decoupled form $\tilde{X}^\top A Y$ where $Y$ is an independent copy of $X$ . The decoupling inequality states that for symmetric $A$ :

$\mathbb{P}(|X^\top A X - \mathbb{E}[X^\top A X]| \geq t) \leq C \cdot \mathbb{P}(|X^\top A Y| \geq ct)$

This reduces the problem from a quadratic form (products $X_i X_j$ ) to a bilinear form (products $X_i Y_j$ ), which is easier to handle because $X$ and $Y$ are independent.

Step 2 (Condition and apply Hoeffding). Condition on $Y$ . Then $X^\top A Y = \sum_i X_i (AY)_i$ is a sum of independent sub-Gaussian variables with variance proxy controlled by $\|AY\|^2$ . Apply sub-Gaussian concentration to get a bound in terms of $\|AY\|$ .

Step 3 (Control $\|AY\|$ ). Use the bound $\|AY\| \leq \|A\|_{\text{op}} \|Y\|$ for the operator-norm term and $\mathbb{E}[\|AY\|^2] = \|A\|_F^2$ (when $Y$ is isotropic) for the Frobenius term. Combine using a case split on whether $\|Y\|$ is typical or large.

Why It Matters

The Hanson-Wright inequality is essential whenever you work with second-order statistics of random vectors:

Chi-squared concentration: $\|X\|^2 = X^\top I X$ , so $A = I$ and the bound gives $\mathbb{P}(|\|X\|^2 - n| \geq t) \leq 2\exp(-c\min(t^2/n, t))$ . The Frobenius regime ( $t^2/n$ ) dominates for $t \leq n$ ; the operator regime ( $t$ ) dominates for $t \geq n$ .
Random kernel evaluations: inner products $x^\top y$ for random $x, y$ are quadratic in the joint vector $(x, y)$ .
Covariance estimation: off-diagonal entries of $\hat{\Sigma}$ involve terms like $\frac{1}{n}\sum_k X_k^{(i)} X_k^{(j)}$ , which are quadratic forms.
Second-order chaos: any polynomial of degree 2 in sub-Gaussian variables.

Failure Mode

The constant $c$ in the inequality is universal but unspecified. For precise numerical bounds in applications, you may need to track it through the proof. Also, Hanson-Wright requires independent entries; for dependent sub-Gaussian vectors, you need modified versions (e.g., for vectors with sub-Gaussian norm bounds but dependent entries, the inequality may still hold but with $K$ replaced by the sub-Gaussian norm of the entire vector).

report a correction →

Lemma

Decoupling Inequality for Quadratic Forms

Statement

Let $X$ have independent centered entries and let $A$ be a symmetric matrix with $A_{ii} = 0$ . Let $Y$ be an independent copy of $X$ . Then for all convex, increasing functions $\Phi$ on $[0, \infty)$ :

$\mathbb{E}\!\Bigl[\Phi\!\bigl(|X^\top A X|\bigr)\Bigr] \leq \mathbb{E}\!\Bigl[\Phi\!\bigl(C \cdot |X^\top A Y|\bigr)\Bigr]$

where $C$ is a universal constant.

Intuition

Decoupling replaces the "entangled" quadratic form $\sum_{i \neq j} A_{ij} X_i X_j$ (where each $X_i$ appears in multiple terms) with the "decoupled" bilinear form $\sum_{i,j} A_{ij} X_i Y_j$ (where each $X_i$ and $Y_j$ appear in separate roles). The decoupled version is easier to analyze because once you condition on $Y$ , the sum is linear in $X$ , and all the standard sub-Gaussian tools apply.

Proof Sketch

The proof uses a symmetrization-style argument. Introduce the decoupled form and use the independence of $X$ and $Y$ together with the symmetry of $A$ to show that the tails of the coupled form are controlled by those of the decoupled form. The universal constant $C$ absorbs a factor from the randomization step.

Why It Matters

Decoupling is the key technical device that makes Hanson-Wright provable. It converts a hard problem (concentrating a quadratic form) into an easier one (concentrating a bilinear form, which is linear once you condition on one factor). This technique appears throughout the theory of U-statistics and chaos processes.

Failure Mode

Decoupling requires the diagonal of $A$ to be zero (or handled separately). The diagonal terms $A_{ii} X_i^2$ are not quadratic in the same sense. They are sums of independent sub-exponential variables and are controlled separately using standard sub-exponential concentration.

report a correction →

Two Regimes Explained

The Hanson-Wright bound can be rewritten in high-probability form. With probability at least $1 - \delta$ :

$|X^\top A X - \mathbb{E}[X^\top A X]| \lesssim K^2\!\left(\|A\|_F \sqrt{\log(2/\delta)} + \|A\|_{\text{op}} \log(2/\delta)\right)$

The two terms correspond to the two norms:

Regime	Tail behavior	Dominated by	Example
Gaussian ( $t$ small)	$e^{-ct^2}$	$\\|A\\|_F$	Chi-squared with $t \ll n$
Extreme ( $t$ large)	$e^{-ct}$	$\\|A\\|_{\text{op}}$	Chi-squared with $t \gg n$

Canonical Examples

Example

Chi-squared concentration

Let $X \sim \mathcal{N}(0, I_n)$ and $A = I_n$ . Then $X^\top A X = \|X\|^2 \sim \chi^2_n$ with $\mathbb{E}[\|X\|^2] = n$ . The norms are $\|I\|_F = \sqrt{n}$ and $\|I\|_{\text{op}} = 1$ . Hanson-Wright gives:

$\mathbb{P}(|\|X\|^2 - n| \geq t) \leq 2\exp\!\bigl(-c\min(t^2/n,\; t)\bigr)$

For $t = \epsilon n$ (relative deviation $\epsilon$ ): the bound is $\exp(-c\epsilon^2 n)$ when $\epsilon \leq 1$ (Gaussian regime) and $\exp(-c\epsilon n)$ when $\epsilon \geq 1$ (extreme regime). This matches the known chi-squared tail behavior.

Example

Random kernel evaluation

Let $x, y \in \mathbb{R}^d$ be independent vectors with i.i.d. sub-Gaussian entries of norm $K$ . The inner product $x^\top y = Z^\top A Z$ where $Z = (x, y) \in \mathbb{R}^{2d}$ and $A$ is a $2d \times 2d$ block matrix with off-diagonal blocks $I_d/2$ and zero diagonal blocks.

Then $\|A\|_F = \sqrt{d}/\sqrt{2}$ and $\|A\|_{\text{op}} = 1/2$ . Hanson-Wright gives:

$\mathbb{P}(|x^\top y| \geq t) \leq 2\exp\!\bigl(-c\min(t^2/d,\; t)\bigr)$

So random inner products in $\mathbb{R}^d$ concentrate around 0 with fluctuations of order $\sqrt{d}$ .

Example

Quadratic form with rank-r matrix

If $A$ has rank $r$ with eigenvalues $\lambda_1 \geq \cdots \geq \lambda_r > 0$ , then $\|A\|_F = \sqrt{\sum_i \lambda_i^2}$ and $\|A\|_{\text{op}} = \lambda_1$ .

The crossover point is $t^* \approx \sum_i \lambda_i^2 / \lambda_1$ . If $A$ is rank-1 ( $A = \lambda vv^\top$ ), then $\|A\|_F = |\lambda| = \|A\|_{\text{op}}$ and the two regimes are identical: $e^{-ct^2/\lambda^2}$ transitions directly to $e^{-ct/|\lambda|}$ at $t \approx |\lambda|$ .

Common Confusions

Watch Out

Hanson-Wright is not just Hoeffding applied to products

A naive approach would be to note that each term $A_{ij} X_i X_j$ is sub-exponential (product of two sub-Gaussians) and apply a sub-exponential concentration bound. This gives a bound in terms of $\sum_{ij} A_{ij}^2 = \|A\|_F^2$ , which captures the Frobenius regime but misses the tighter operator-norm regime for large deviations. Hanson-Wright is strictly stronger because it also captures the $\|A\|_{\text{op}}$ term through the decoupling argument.

Watch Out

The matrix A need not be symmetric or positive semidefinite

The Hanson-Wright inequality applies to any matrix $A$ , not just symmetric or PSD ones. For a general $A$ , $X^\top A X = X^\top ((A + A^\top)/2) X$ because $X^\top B X = 0$ for any antisymmetric $B$ (since $X^\top B X$ is a scalar that equals its own negative). So you can always reduce to the symmetric part.

Watch Out

Sub-Gaussian entries, not sub-Gaussian vectors

Hanson-Wright requires the entries of $X$ to be independent sub-Gaussian, not just the vector $X$ to have sub-Gaussian norm. For random vectors with dependent entries (like uniform on the sphere), the standard Hanson-Wright does not apply directly. Modified versions exist but require different techniques (e.g., transportation-cost arguments).

Summary

Hanson-Wright controls $|X^\top A X - \mathbb{E}[X^\top A X]|$ for sub-Gaussian $X$
Two-term bound: $\exp(-c\min(t^2/\|A\|_F^2, t/\|A\|_{\text{op}}))$
Frobenius norm $\|A\|_F$ controls the Gaussian (small deviation) regime
Operator norm $\|A\|_{\text{op}}$ controls the extreme (large deviation) regime
Crossover at $t^* = \|A\|_F^2 / \|A\|_{\text{op}}$
Decoupling is the key proof technique: replace $X^\top A X$ with $X^\top A Y$ using an independent copy $Y$
For $A = I$ : recovers chi-squared concentration
Applies to random kernel evaluations, covariance estimation, second-order chaos

Exercises

ExerciseCore

Problem

Let $X \in \mathbb{R}^n$ have i.i.d. $\mathcal{N}(0, 1)$ entries and let $A = \frac{1}{n}(11^\top - I)$ where $1$ is the all-ones vector. The quadratic form $X^\top A X = \frac{1}{n}(\sum_i X_i)^2 - \frac{1}{n}\sum_i X_i^2$ . Compute $\|A\|_F$ and $\|A\|_{\text{op}}$ , and use Hanson-Wright to bound the deviation of $X^\top A X$ from its expectation.

ExerciseAdvanced

Problem

Let $X \in \mathbb{R}^n$ have i.i.d. sub-Gaussian entries with parameter $K$ and let $P$ be the orthogonal projection onto a $k$ -dimensional subspace. Use Hanson-Wright to show that $\|PX\|^2 = X^\top P X$ concentrates around $k$ with sub-Gaussian fluctuations of order $\sqrt{k}$ .

ExerciseResearch

Problem

The Hanson-Wright inequality gives a bound of order $\|A\|_F\sqrt{\log(1/\delta)}$ in the Gaussian regime. Show that this is tight (up to constants) by computing the variance of $X^\top A X$ when $X \sim \mathcal{N}(0, I)$ and verifying that $\text{Var}(X^\top A X) = 2\|A\|_F^2$ for symmetric $A$ .

References

Canonical:

Hanson & Wright, "A Bound on Tail Probabilities for Quadratic Forms in Independent Random Variables" (1971)
Rudelson & Vershynin, "Hanson-Wright Inequality and Sub-Gaussian Concentration" (2013)
Vershynin, High-Dimensional Probability (2018), Chapter 6

Current:

Wainwright, High-Dimensional Statistics (2019), Chapter 6
Adamczak, "A Note on the Hanson-Wright Inequality for Random Vectors with Dependencies" (2015)
Boucheron, Lugosi, Massart, Concentration Inequalities (2013), Chapters 2-6

Next Topics

Building on quadratic form concentration:

Random matrix theory overview: asymptotic behavior of eigenvalues and eigenvectors of large random matrices
Kernels and RKHS: random kernel evaluations use Hanson-Wright for concentration of kernel matrices

Last reviewed: April 22, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Chi-Squared Concentrationlayer 2 · tier 1
Sub-Gaussian Random Variableslayer 2 · tier 1
Matrix Concentrationlayer 3 · tier 1

Derived topics

2

Kernels and Reproducing Kernel Hilbert Spaceslayer 3 · tier 2
Random Matrix Theory Overviewlayer 4 · tier 2

Graph-backed continuations

Random Matrix Theory Overview Kernels and Reproducing Kernel Hilbert Spaces