Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Foundations

Gram Matrices and Kernel Matrices

The Gram matrix G_{ij} = <x_i, x_j> encodes pairwise inner products of a dataset. Always PSD. Appears in kernel methods, PCA, SVD, and attention. Understanding it connects linear algebra to ML.

CoreTier 1Stable~40 min

Why This Matters

The Gram matrix appears everywhere in ML, often without being named. When you compute QKTQK^T in attention, that is a Gram matrix. When you do kernel PCA, you operate on a kernel matrix (which is a Gram matrix in feature space). When you compute XTXX^TX for PCA or linear regression, that is a Gram matrix of the columns of XX.

Understanding the Gram matrix and its properties (especially positive semidefiniteness) is the key link between linear algebra and kernel methods, and between kernel methods and attention.

Definitions

Definition

Gram Matrix

Given vectors x1,,xnx_1, \ldots, x_n in an inner product space, the Gram matrix GRn×nG \in \mathbb{R}^{n \times n} has entries:

Gij=xi,xjG_{ij} = \langle x_i, x_j \rangle

If XRn×dX \in \mathbb{R}^{n \times d} is the matrix with xix_i as its ii-th row, then G=XXTG = XX^T.

Definition

Kernel Matrix

Given a kernel function k:X×XRk: \mathcal{X} \times \mathcal{X} \to \mathbb{R} and data points x1,,xnXx_1, \ldots, x_n \in \mathcal{X}, the kernel matrix (or Gram matrix of the kernel) has entries:

Kij=k(xi,xj)K_{ij} = k(x_i, x_j)

If kk is a valid (positive definite) kernel, then by Mercer's theorem, there exists a feature map ϕ\phi such that k(xi,xj)=ϕ(xi),ϕ(xj)k(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle. The kernel matrix is the Gram matrix of the mapped data.

The standard inner product x,y=xTy\langle x, y \rangle = x^T y gives the simplest case. The Gram matrix with the standard inner product is just G=XXTG = XX^T. A kernel matrix with kernel kk generalizes this to Kij=k(xi,xj)K_{ij} = k(x_i, x_j), which computes inner products in a (possibly infinite-dimensional) feature space.

Positive Semidefiniteness

Theorem

Gram Matrices are Positive Semidefinite

Statement

For any vectors x1,,xnx_1, \ldots, x_n in a real inner product space, the Gram matrix GG with Gij=xi,xjG_{ij} = \langle x_i, x_j \rangle is positive semidefinite. That is, for all cRnc \in \mathbb{R}^n:

cTGc=i=1ncixi20c^T G c = \left\| \sum_{i=1}^{n} c_i x_i \right\|^2 \geq 0

Intuition

The quadratic form cTGcc^T G c computes the squared norm of a linear combination of the data vectors. Squared norms are always non-negative. This is why Gram matrices are always PSD: they represent inner product structure, and inner products induce non-negative norms.

Proof Sketch

cTGc=i,jcicjxi,xj=icixi,jcjxj=icixi20c^T G c = \sum_{i,j} c_i c_j \langle x_i, x_j \rangle = \left\langle \sum_i c_i x_i, \sum_j c_j x_j \right\rangle = \left\| \sum_i c_i x_i \right\|^2 \geq 0

The second equality uses bilinearity of the inner product. The inequality follows from the definition of a norm.

Why It Matters

PSD is the central property of Gram matrices. It guarantees that eigenvalues are non-negative, that the matrix defines a valid (semi-)metric on the data, and that kernel methods are well-posed. Any matrix that arises as pairwise inner products must be PSD.

Failure Mode

The Gram matrix is PSD but may not be positive definite. It is singular (has zero eigenvalues) when the vectors x1,,xnx_1, \ldots, x_n are linearly dependent. For n>dn > d (more points than dimensions), the Gram matrix XXTXX^T has rank at most dd, so at least ndn - d eigenvalues are zero.

Eigenvalues and Data Geometry

Proposition

Gram Matrix Eigenvalues Reflect Data Geometry

Statement

The non-zero eigenvalues of G=XXTG = XX^T are the same as the non-zero eigenvalues of XTXX^TX. If the SVD of XX is X=UΣVTX = U\Sigma V^T, then:

G=XXT=UΣ2UTG = XX^T = U\Sigma^2 U^T

The eigenvalues of GG are σ12,,σr2,0,,0\sigma_1^2, \ldots, \sigma_r^2, 0, \ldots, 0 where r=rank(X)r = \text{rank}(X) and σi\sigma_i are the singular values of XX.

Intuition

The eigenvalues of the Gram matrix tell you how the data is spread in different directions. Large eigenvalues correspond to directions of high variance in the data. This is exactly the information PCA extracts. Computing PCA from G=XXTG = XX^T (dual PCA) is equivalent to computing it from XTXX^TX (primal PCA), but the matrix sizes differ: GG is n×nn \times n while XTXX^TX is d×dd \times d.

Proof Sketch

If X=UΣVTX = U\Sigma V^T, then XXT=UΣVTVΣTUT=UΣΣTUT=UΣ2UTXX^T = U\Sigma V^T V\Sigma^T U^T = U\Sigma\Sigma^T U^T = U\Sigma^2 U^T. Similarly, XTX=VΣ2VTX^TX = V\Sigma^2 V^T. Both have eigenvalues σi2\sigma_i^2.

Why It Matters

This duality between XXTXX^T and XTXX^TX is the foundation of kernel PCA. When dnd \gg n (more features than samples), working with the n×nn \times n Gram matrix is cheaper. When you use a kernel, you only have access to KK (the Gram matrix in feature space), and the eigendecomposition of KK gives you PCA in feature space.

Failure Mode

Computing the full eigendecomposition of GG costs O(n3)O(n^3). For large nn, this is prohibitive, and you need approximate methods (randomized SVD, Nystrom approximation). Also, the Gram matrix requires O(n2)O(n^2) storage, which limits applicability to datasets with many observations.

Connections to ML

Kernel Methods

In kernel methods, you work with Kij=k(xi,xj)K_{ij} = k(x_i, x_j) and never compute the feature map ϕ\phi explicitly. The kernel matrix KK is all you need for:

  • Kernel SVM: the dual formulation involves KK directly
  • Kernel PCA: eigendecompose KK to get principal components in feature space
  • Gaussian processes: the kernel matrix (plus noise) is the covariance matrix of the function values

Attention

In a transformer, the attention matrix is:

A=softmax(QKTdk)A = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)

The matrix QKTQK^T has entries (QKT)ij=qiTkj(QK^T)_{ij} = q_i^T k_j, which is a Gram matrix of queries against keys. Before the softmax, this is a linear kernel matrix. The softmax converts it into a row-stochastic matrix, but the underlying structure is a Gram matrix.

PCA and SVD

PCA on nn data points in Rd\mathbb{R}^d can be computed from either XTXX^TX (d×dd \times d, primal) or XXTXX^T (n×nn \times n, dual). The dual formulation uses the Gram matrix and is the starting point for kernel PCA.

Common Confusions

Watch Out

Gram matrix vs covariance matrix

The Gram matrix is XXTXX^T (size n×nn \times n, pairwise similarities between data points). The covariance matrix is XTX/nX^TX / n (size d×dd \times d, pairwise relationships between features), assuming centered data. They encode complementary views of the same dataset: point similarities vs feature correlations.

Watch Out

PSD does not mean all entries are non-negative

A Gram matrix can have negative entries. For example, if x1=(1,0)x_1 = (1, 0) and x2=(1,0)x_2 = (-1, 0), then G12=1G_{12} = -1. PSD refers to the quadratic form cTGc0c^TGc \geq 0 for all cc, not to individual entries.

Key Takeaways

  • Gram matrix: Gij=xi,xjG_{ij} = \langle x_i, x_j \rangle. For data matrix XX: G=XXTG = XX^T
  • Always PSD. Eigenvalues are non-negative.
  • Kernel matrix: same structure with k(xi,xj)k(x_i, x_j) replacing xi,xj\langle x_i, x_j \rangle
  • Eigenvalues of GG equal squared singular values of XX
  • Appears in kernel methods, PCA (dual formulation), SVD, and attention (QKTQK^T)

Exercises

ExerciseCore

Problem

Given x1=(1,2)x_1 = (1, 2), x2=(3,0)x_2 = (3, 0), x3=(0,1)x_3 = (0, 1), compute the Gram matrix G=XXTG = XX^T where rows of XX are x1,x2,x3x_1, x_2, x_3. Verify that GG is PSD by checking that all eigenvalues are non-negative.

ExerciseAdvanced

Problem

Prove that the attention score matrix QKTQK^T in a transformer is a Gram matrix. What does this imply about the eigenvalues of QKTQK^T before the softmax is applied? Is QKTQK^T always PSD?

References

Canonical:

  • Horn & Johnson, Matrix Analysis (2013), Chapter 7 (Positive Definite Matrices)
  • Scholkopf & Smola, Learning with Kernels (2002), Chapter 2

Current:

  • Vaswani et al., "Attention Is All You Need" (2017), for the attention connection
  • Murphy, Probabilistic Machine Learning: An Introduction (2022), Chapter 16

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics