Foundations
Matrix Norms
Frobenius, spectral, and nuclear norms for matrices. Submultiplicativity. When and why each norm appears in ML theory.
Prerequisites
Why This Matters
Bounding the norm of a weight matrix is central to generalization theory (spectral norm bounds), low-rank approximation (nuclear norm), and optimization analysis (operator norms control gradient magnitudes). Different norms capture different structural properties of matrices. Understanding norms requires familiarity with basic matrix operations.
Core Definitions
Matrix Norm (General)
A matrix norm on is a function satisfying: (1) , (2) , (3) (triangle inequality).
Frobenius Norm
The Frobenius norm is the entrywise norm:
where are the singular values of .
Spectral Norm (Operator Norm)
The spectral norm is the largest singular value:
This is the operator norm induced by the vector norm. Here denotes the largest singular value.
Nuclear Norm (Trace Norm)
The nuclear norm is the sum of singular values:
It is the convex envelope of the rank function on the unit spectral norm ball, making it the tightest convex relaxation of rank minimization.
Norm Relationships
For with rank :
Main Theorems
Submultiplicativity of Matrix Norms
Statement
The Frobenius and spectral norms are submultiplicative: for compatible matrices and ,
Intuition
Composing two linear maps cannot amplify more than the product of their individual amplification factors. For the spectral norm this is immediate: the maximum stretch of cannot exceed the maximum stretch of times the maximum stretch of .
Proof Sketch
For the spectral norm: for all , so . For Frobenius: write column by column and use Cauchy-Schwarz on each column, then sum.
Why It Matters
Submultiplicativity is used constantly in deep learning theory. Bounding by controls how signals and gradients propagate through layers. Spectral norm regularization exploits this directly.
Failure Mode
The nuclear norm is not submultiplicative: holds, but does not hold in general.
When to Use Each Norm
| Norm | Use case | Why |
|---|---|---|
| Frobenius | Weight decay, matrix factorization | Differentiable, easy to compute, equals penalty on parameters |
| Spectral | Generalization bounds, Lipschitz constraints | Controls worst-case amplification of a layer |
| Nuclear | Low-rank matrix completion, robust PCA | Convex relaxation of rank |
Common Confusions
Frobenius norm is not an operator norm
The Frobenius norm cannot be written as for any vector norm. It treats the matrix as a vector in and takes the norm. The spectral norm is the operator norm.
Spectral norm vs spectral radius
The spectral norm uses singular values. The spectral radius uses eigenvalues. For symmetric matrices they coincide, but in general with possible strict inequality.
Exercises
Problem
Compute , , and for .
Problem
Let be weight matrices in a neural network. Show that . What does this imply about gradient norms during backpropagation?
References
Canonical:
- Horn & Johnson, Matrix Analysis (2013), Chapter 5
- Golub & Van Loan, Matrix Computations (2013), Chapter 2
- Trefethen & Bau, Numerical Linear Algebra (1997), Lectures 3-5 (norms and SVD)
For ML context:
- Neyshabur, Bhojanapalli, Srebro, "A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds" (2018)
- Vershynin, High-Dimensional Probability (2018), Section 4.4 (operator norm of random matrices)
- Recht, Fazel, Parrilo, "Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization" (2010), SIAM Review
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.