Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Foundations

Matrix Norms

Frobenius, spectral, and nuclear norms for matrices. Submultiplicativity. When and why each norm appears in ML theory.

CoreTier 1Stable~35 min

Why This Matters

Bounding the norm of a weight matrix is central to generalization theory (spectral norm bounds), low-rank approximation (nuclear norm), and optimization analysis (operator norms control gradient magnitudes). Different norms capture different structural properties of matrices. Understanding norms requires familiarity with basic matrix operations.

Core Definitions

Definition

Matrix Norm (General)

A matrix norm on Rm×n\mathbb{R}^{m \times n} is a function :Rm×n[0,)\|\cdot\|: \mathbb{R}^{m \times n} \to [0, \infty) satisfying: (1) A=0    A=0\|A\| = 0 \iff A = 0, (2) αA=αA\|\alpha A\| = |\alpha| \|A\|, (3) A+BA+B\|A + B\| \leq \|A\| + \|B\| (triangle inequality).

Definition

Frobenius Norm

The Frobenius norm is the entrywise 2\ell_2 norm:

AF=i,jaij2=tr(ATA)=i=1min(m,n)σi2\|A\|_F = \sqrt{\sum_{i,j} a_{ij}^2} = \sqrt{\text{tr}(A^T A)} = \sqrt{\sum_{i=1}^{\min(m,n)} \sigma_i^2}

where σi\sigma_i are the singular values of AA.

Definition

Spectral Norm (Operator Norm)

The spectral norm is the largest singular value:

A2=σmax(A)=supx0Ax2x2\|A\|_2 = \sigma_{\max}(A) = \sup_{x \neq 0} \frac{\|Ax\|_2}{\|x\|_2}

This is the operator norm induced by the vector 2\ell_2 norm. Here σmax\sigma_{\max} denotes the largest singular value.

Definition

Nuclear Norm (Trace Norm)

The nuclear norm is the sum of singular values:

A=i=1min(m,n)σi\|A\|_* = \sum_{i=1}^{\min(m,n)} \sigma_i

It is the convex envelope of the rank function on the unit spectral norm ball, making it the tightest convex relaxation of rank minimization.

Norm Relationships

For ARm×nA \in \mathbb{R}^{m \times n} with rank rr:

A2AFrA2\|A\|_2 \leq \|A\|_F \leq \sqrt{r} \, \|A\|_2

AFArAF\|A\|_F \leq \|A\|_* \leq \sqrt{r} \, \|A\|_F

A2ArA2\|A\|_2 \leq \|A\|_* \leq r \, \|A\|_2

Main Theorems

Theorem

Submultiplicativity of Matrix Norms

Statement

The Frobenius and spectral norms are submultiplicative: for compatible matrices AA and BB,

ABFAFBF,AB2A2B2\|AB\|_F \leq \|A\|_F \|B\|_F, \qquad \|AB\|_2 \leq \|A\|_2 \|B\|_2

Intuition

Composing two linear maps cannot amplify more than the product of their individual amplification factors. For the spectral norm this is immediate: the maximum stretch of ABAB cannot exceed the maximum stretch of AA times the maximum stretch of BB.

Proof Sketch

For the spectral norm: ABx2A2Bx2A2B2x2\|ABx\|_2 \leq \|A\|_2 \|Bx\|_2 \leq \|A\|_2 \|B\|_2 \|x\|_2 for all xx, so AB2A2B2\|AB\|_2 \leq \|A\|_2 \|B\|_2. For Frobenius: write ABAB column by column and use Cauchy-Schwarz on each column, then sum.

Why It Matters

Submultiplicativity is used constantly in deep learning theory. Bounding WLW1\|W_L \cdots W_1\| by iWi\prod_i \|W_i\| controls how signals and gradients propagate through layers. Spectral norm regularization exploits this directly.

Failure Mode

The nuclear norm is not submultiplicative: ABA2B\|AB\|_* \leq \|A\|_2 \|B\|_* holds, but ABAB\|AB\|_* \leq \|A\|_* \|B\|_* does not hold in general.

When to Use Each Norm

NormUse caseWhy
Frobenius F\|\cdot\|_FWeight decay, matrix factorizationDifferentiable, easy to compute, equals 2\ell_2 penalty on parameters
Spectral 2\|\cdot\|_2Generalization bounds, Lipschitz constraintsControls worst-case amplification of a layer
Nuclear \|\cdot\|_*Low-rank matrix completion, robust PCAConvex relaxation of rank

Common Confusions

Watch Out

Frobenius norm is not an operator norm

The Frobenius norm cannot be written as supx0Ax/x\sup_{x \neq 0} \|Ax\| / \|x\| for any vector norm. It treats the matrix as a vector in Rmn\mathbb{R}^{mn} and takes the 2\ell_2 norm. The spectral norm is the operator norm.

Watch Out

Spectral norm vs spectral radius

The spectral norm A2=σmax(A)\|A\|_2 = \sigma_{\max}(A) uses singular values. The spectral radius ρ(A)=maxiλi(A)\rho(A) = \max_i |\lambda_i(A)| uses eigenvalues. For symmetric matrices they coincide, but in general ρ(A)A2\rho(A) \leq \|A\|_2 with possible strict inequality.

Exercises

ExerciseCore

Problem

Compute AF\|A\|_F, A2\|A\|_2, and A\|A\|_* for A=(3004)A = \begin{pmatrix} 3 & 0 \\ 0 & 4 \end{pmatrix}.

ExerciseAdvanced

Problem

Let W1,,WLW_1, \ldots, W_L be weight matrices in a neural network. Show that WLW12i=1LWi2\|W_L \cdots W_1\|_2 \leq \prod_{i=1}^{L} \|W_i\|_2. What does this imply about gradient norms during backpropagation?

References

Canonical:

  • Horn & Johnson, Matrix Analysis (2013), Chapter 5
  • Golub & Van Loan, Matrix Computations (2013), Chapter 2
  • Trefethen & Bau, Numerical Linear Algebra (1997), Lectures 3-5 (norms and SVD)

For ML context:

  • Neyshabur, Bhojanapalli, Srebro, "A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds" (2018)
  • Vershynin, High-Dimensional Probability (2018), Section 4.4 (operator norm of random matrices)
  • Recht, Fazel, Parrilo, "Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization" (2010), SIAM Review

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics