Foundations
Gram Matrices and Kernel Matrices
The Gram matrix G_{ij} = <x_i, x_j> encodes pairwise inner products of a dataset. Always PSD. Appears in kernel methods, PCA, SVD, and attention. Understanding it connects linear algebra to ML.
Why This Matters
The Gram matrix appears everywhere in ML, often without being named. When you compute in attention, that is a Gram matrix. When you do kernel PCA, you operate on a kernel matrix (which is a Gram matrix in feature space). When you compute for PCA or linear regression, that is a Gram matrix of the columns of .
Understanding the Gram matrix and its properties (especially positive semidefiniteness) is the key link between linear algebra and kernel methods, and between kernel methods and attention.
Definitions
Gram Matrix
Given vectors in an inner product space, the Gram matrix has entries:
If is the matrix with as its -th row, then .
Kernel Matrix
Given a kernel function and data points , the kernel matrix (or Gram matrix of the kernel) has entries:
If is a valid (positive definite) kernel, then by Mercer's theorem, there exists a feature map such that . The kernel matrix is the Gram matrix of the mapped data.
The standard inner product gives the simplest case. The Gram matrix with the standard inner product is just . A kernel matrix with kernel generalizes this to , which computes inner products in a (possibly infinite-dimensional) feature space.
Positive Semidefiniteness
Gram Matrices are Positive Semidefinite
Statement
For any vectors in a real inner product space, the Gram matrix with is positive semidefinite. That is, for all :
Intuition
The quadratic form computes the squared norm of a linear combination of the data vectors. Squared norms are always non-negative. This is why Gram matrices are always PSD: they represent inner product structure, and inner products induce non-negative norms.
Proof Sketch
The second equality uses bilinearity of the inner product. The inequality follows from the definition of a norm.
Why It Matters
PSD is the central property of Gram matrices. It guarantees that eigenvalues are non-negative, that the matrix defines a valid (semi-)metric on the data, and that kernel methods are well-posed. Any matrix that arises as pairwise inner products must be PSD.
Failure Mode
The Gram matrix is PSD but may not be positive definite. It is singular (has zero eigenvalues) when the vectors are linearly dependent. For (more points than dimensions), the Gram matrix has rank at most , so at least eigenvalues are zero.
Eigenvalues and Data Geometry
Gram Matrix Eigenvalues Reflect Data Geometry
Statement
The non-zero eigenvalues of are the same as the non-zero eigenvalues of . If the SVD of is , then:
The eigenvalues of are where and are the singular values of .
Intuition
The eigenvalues of the Gram matrix tell you how the data is spread in different directions. Large eigenvalues correspond to directions of high variance in the data. This is exactly the information PCA extracts. Computing PCA from (dual PCA) is equivalent to computing it from (primal PCA), but the matrix sizes differ: is while is .
Proof Sketch
If , then . Similarly, . Both have eigenvalues .
Why It Matters
This duality between and is the foundation of kernel PCA. When (more features than samples), working with the Gram matrix is cheaper. When you use a kernel, you only have access to (the Gram matrix in feature space), and the eigendecomposition of gives you PCA in feature space.
Failure Mode
Computing the full eigendecomposition of costs . For large , this is prohibitive, and you need approximate methods (randomized SVD, Nystrom approximation). Also, the Gram matrix requires storage, which limits applicability to datasets with many observations.
Connections to ML
Kernel Methods
In kernel methods, you work with and never compute the feature map explicitly. The kernel matrix is all you need for:
- Kernel SVM: the dual formulation involves directly
- Kernel PCA: eigendecompose to get principal components in feature space
- Gaussian processes: the kernel matrix (plus noise) is the covariance matrix of the function values
Attention
In a transformer, the attention matrix is:
The matrix has entries , which is a Gram matrix of queries against keys. Before the softmax, this is a linear kernel matrix. The softmax converts it into a row-stochastic matrix, but the underlying structure is a Gram matrix.
PCA and SVD
PCA on data points in can be computed from either (, primal) or (, dual). The dual formulation uses the Gram matrix and is the starting point for kernel PCA.
Common Confusions
Gram matrix vs covariance matrix
The Gram matrix is (size , pairwise similarities between data points). The covariance matrix is (size , pairwise relationships between features), assuming centered data. They encode complementary views of the same dataset: point similarities vs feature correlations.
PSD does not mean all entries are non-negative
A Gram matrix can have negative entries. For example, if and , then . PSD refers to the quadratic form for all , not to individual entries.
Key Takeaways
- Gram matrix: . For data matrix :
- Always PSD. Eigenvalues are non-negative.
- Kernel matrix: same structure with replacing
- Eigenvalues of equal squared singular values of
- Appears in kernel methods, PCA (dual formulation), SVD, and attention ()
Exercises
Problem
Given , , , compute the Gram matrix where rows of are . Verify that is PSD by checking that all eigenvalues are non-negative.
Problem
Prove that the attention score matrix in a transformer is a Gram matrix. What does this imply about the eigenvalues of before the softmax is applied? Is always PSD?
References
Canonical:
- Horn & Johnson, Matrix Analysis (2013), Chapter 7 (Positive Definite Matrices)
- Scholkopf & Smola, Learning with Kernels (2002), Chapter 2
Current:
- Vaswani et al., "Attention Is All You Need" (2017), for the attention connection
- Murphy, Probabilistic Machine Learning: An Introduction (2022), Chapter 16
Next Topics
- Kernels and RKHS: the theoretical foundation for kernel matrices
- Principal component analysis: PCA from the dual (Gram matrix) perspective
- Attention mechanism theory: where Gram matrices meet modern deep learning
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.