The Kernel Trick

Sneiderman, Robby

ML Methods

The Kernel Trick

Any algorithm whose computation accesses data only through inner products can be lifted to a feature space implicitly defined by a kernel, without ever computing the features. Worked polynomial-kernel inner-product equivalence, infinite-feature interpretation of the Gaussian RBF, Mercer's condition, and the four canonical applications: kernel SVM, kernel ridge regression, kernel PCA, kernel k-means.

CoreTier 1StableCore spine~60 min

Prerequisites

Support Vector Machines Ridge Regression Bayesian Linear Regression Convex Optimization Basics

Prereq Map

Why This Matters

The kernel trick is the move that turns linear algorithms into nonlinear ones at no algorithmic cost. Take any method that computes only with inner products $\langle x_i, x_j\rangle$ (SVM, ridge regression, PCA, k-means, perceptron) and replace those inner products with a kernel function $k(x_i, x_j)$ that secretly computes inner products in some higher-dimensional feature space. The algorithm runs unchanged; it just thinks in a richer geometry.

The trick has two practical wins and one theoretical payoff:

Practical: a polynomial kernel of degree $p$ in $\mathbb R^d$ computes in $O(d)$ time, but operates as if the data had been lifted to $\binom{d+p}{p}$ features. The Gaussian RBF lifts to an infinite-dimensional space and still computes in $O(d)$ time per pair.
Theoretical: Mercer's condition tells you exactly which $k$ 's are inner products in some space (positive semi-definite ones), and the RKHS theory gives a precise account of the function space those kernels span.

This page focuses on the trick itself with the algebra worked out end-to-end. For the full theory of reproducing kernel Hilbert spaces, see kernels and RKHS.

Mental Model

A linear classifier in $\mathbb R^d$ separates points with a hyperplane $w^\top x = 0$ . If the data is not linearly separable, you can lift it: define $\phi: \mathbb R^d \to \mathbb R^D$ for some larger $D$ , classify in feature space with $w^\top \phi(x) = 0$ , and project back. The new decision boundary in $\mathbb R^d$ is nonlinear.

Two problems with the explicit lift:

$D$ can be huge (or infinite for some useful $\phi$ ).
Computing $\phi(x)$ for every training point costs $O(D)$ , and the algorithm needs $O(nD)$ inner products at minimum.

The kernel trick observes: many algorithms never need $\phi(x)$ itself. They only need pairwise inner products $\phi(x_i)^\top \phi(x_j)$ . If we can compute $k(x_i, x_j) = \phi(x_i)^\top \phi(x_j)$ directly from $x_i$ and $x_j$ , without constructing $\phi(x_i)$ or $\phi(x_j)$ , we get the nonlinear lift for free.

The polynomial kernel $k(x, x') = (x^\top x' + c)^p$ is the canonical illustration. Expanding $(x^\top x')^2$ in $\mathbb R^2$ gives back exactly the inner product of two specific 3-dimensional vectors, at $O(d)$ cost instead of $O(d^2)$ .

Formal Setup

Definition

The Kernel Trick

A learning algorithm is kernelizable if every operation involving inputs $x_i \in \mathcal X$ can be written using only inner products $\langle x_i, x_j \rangle$ (or, equivalently, pairwise squared distances $\|x_i - x_j\|^2$ ). Given a kernel $k: \mathcal X \times \mathcal X \to \mathbb R$ that is a symmetric, positive semi-definite function, the kernel trick replaces every $\langle x_i, x_j\rangle$ in the algorithm with $k(x_i, x_j)$ .

The result is an algorithm that operates as if it had access to the lifted features $\phi(x)$ , where $\phi$ is the (possibly infinite-dimensional) feature map associated with $k$ by Mercer's theorem, without ever computing $\phi(x)$ explicitly.

The kernel trick is a meta-pattern, not an algorithm. You can kernelize SVM, ridge regression, PCA, k-means, perceptron, k-nearest-neighbors, and more, each producing a "kernel version" of the original method.

The Polynomial Kernel: Worked Inner-Product Equivalence

This is the worked algebra that makes the trick concrete. Take $\mathcal X = \mathbb R^2$ and the degree-2 polynomial kernel

$k(x, x') = (x^\top x')^2.$

Claim. $k(x, x') = \phi(x)^\top \phi(x')$ where $\phi: \mathbb R^2 \to \mathbb R^3$ is

$\phi(x) = \begin{pmatrix} x_1^2 \\ \sqrt 2 \, x_1 x_2 \\ x_2^2 \end{pmatrix}.$

Proof by direct expansion.

$k(x, x') = (x^\top x')^2 = (x_1 x'_1 + x_2 x'_2)^2 = (x_1 x'_1)^2 + 2 (x_1 x'_1)(x_2 x'_2) + (x_2 x'_2)^2.$

Recombine:

$= x_1^2 (x'_1)^2 + 2 x_1 x_2 x'_1 x'_2 + x_2^2 (x'_2)^2.$

Now match this against the inner product of the proposed feature vectors:

$\phi(x)^\top \phi(x') = x_1^2 (x'_1)^2 + \sqrt 2 \cdot \sqrt 2 \, x_1 x_2 \, x'_1 x'_2 + x_2^2 (x'_2)^2 = x_1^2 (x'_1)^2 + 2 x_1 x_2 x'_1 x'_2 + x_2^2 (x'_2)^2.$

These are identical. The $\sqrt 2$ factor in $\phi$ is exactly what produces the $2$ in the cross-term of the kernel expansion. The trick is computing $k(x, x') = (x_1 x'_1 + x_2 x'_2)^2$ in $O(d) = O(2)$ operations (one inner product and one squaring), while the explicit feature inner product $\phi(x)^\top \phi(x')$ would cost $O(D) = O(3)$ . The asymptotics matter when $d$ is larger.

Theorem

Polynomial Kernel Equals Inner Product in a Polynomial Feature Space

Statement

For $x, x' \in \mathbb R^d$ , the polynomial kernel $k(x, x') = (x^\top x' + c)^p$ with $c \ge 0$ and integer $p \ge 1$ satisfies $k(x, x') = \phi_p(x)^\top \phi_p(x')$ , where $\phi_p: \mathbb R^d \to \mathbb R^D$ with $D = \binom{d + p}{p}$ is the feature map containing all monomials in $x_1, \dots, x_d, \sqrt c$ of total degree exactly $p$ , each weighted by the square root of its multinomial coefficient.

So computing $k(x, x')$ costs $O(d)$ (one inner product plus a power), while explicit features cost $O(D)$ . For $d = 100$ and $p = 4$ , $D = \binom{104}{4} \approx 4.6 \times 10^6$ , three orders of magnitude larger than $d$ .

Intuition

A polynomial of degree $\le p$ in $d$ variables has $\binom{d+p}{p}$ monomials. The polynomial kernel implicitly represents the full degree- $p$ polynomial space, so a linear classifier in feature space becomes a degree- $p$ polynomial classifier in $\mathbb R^d$ . The $\sqrt c$ trick (interpreting the constant $c$ as a fictitious feature) is what lets all degrees up to $p$ appear simultaneously, not just degree $p$ .

Proof Sketch

Use the multinomial theorem on $(x_1 x'_1 + x_2 x'_2 + \cdots + x_d x'_d + c)^p$ . Each term is of the form $\binom{p}{\alpha_1, \alpha_2, \dots, \alpha_{d+1}} (x_1 x'_1)^{\alpha_1} \cdots (x_d x'_d)^{\alpha_d} c^{\alpha_{d+1}}$ with $\sum \alpha_i = p$ . Rewrite the multinomial coefficient as $\binom{p}{\alpha} = (\sqrt{\binom{p}{\alpha}})^2$ and absorb a $\sqrt{\binom{p}{\alpha}}$ into each of the two product factors. Then the term equals $(\sqrt{\binom{p}{\alpha}} x_1^{\alpha_1} \cdots x_d^{\alpha_d} c^{\alpha_{d+1}/2}) \cdot (\sqrt{\binom{p}{\alpha}} (x'_1)^{\alpha_1} \cdots (x'_d)^{\alpha_d} c^{\alpha_{d+1}/2})$ , which is the product of two coordinates of $\phi_p(x)$ and $\phi_p(x')$ . Summing over all multi-indices reconstructs $(x^\top x' + c)^p$ .

Why It Matters

This is the concrete demonstration of the kernel trick. Every other kernel-trick claim ultimately reduces to "a kernel computes an inner product in a feature space implicitly," and the polynomial kernel is where the algebra is fully visible. Once you have seen it once, the Gaussian RBF (below, infinite-dimensional features) and the abstract Mercer theory feel like extensions rather than mysteries.

Failure Mode

The polynomial kernel of degree $p$ produces features that grow combinatorially with $d$ and $p$ . For very large $p$ , the kernel matrix can become ill-conditioned (high-degree monomials have wildly different scales), and numerical inversion of $K + \lambda I$ in kernel ridge regression suffers. Practical use: $p \le 4$ or so for most tasks, with feature normalization before applying $k$ . For higher nonlinearity, the Gaussian RBF is usually a better choice.

report a correction →

The Gaussian RBF Kernel: Infinite Feature Space

The Gaussian kernel

$k(x, x') = \exp\!\left(-\frac{\|x - x'\|^2}{2\sigma^2}\right)$

also corresponds to an inner product in a feature space, except that feature space is infinite-dimensional. The cleanest way to see this is via the Taylor expansion of the exponential. Expand $-\|x - x'\|^2/(2\sigma^2)$ as $-\|x\|^2/(2\sigma^2) - \|x'\|^2/(2\sigma^2) + x^\top x'/\sigma^2$ and factor:

$k(x, x') = e^{-\|x\|^2/(2\sigma^2)} e^{-\|x'\|^2/(2\sigma^2)} \cdot e^{x^\top x'/\sigma^2}.$

Now Taylor-expand the second exponential:

$e^{x^\top x'/\sigma^2} = \sum_{m=0}^\infty \frac{(x^\top x')^m}{m! \, \sigma^{2m}} = \sum_{m=0}^\infty \frac{1}{m! \, \sigma^{2m}} \langle \phi_m(x), \phi_m(x') \rangle$

where $\phi_m$ is the degree- $m$ polynomial feature map from the previous section. So

$k(x, x') = \left\langle e^{-\|x\|^2/(2\sigma^2)} \begin{pmatrix} \sigma^{-2 \cdot 0} \phi_0(x)/\sqrt{0!} \\ \sigma^{-2 \cdot 1} \phi_1(x)/\sqrt{1!} \\ \sigma^{-2 \cdot 2} \phi_2(x)/\sqrt{2!} \\ \vdots \end{pmatrix}, \;\; e^{-\|x'\|^2/(2\sigma^2)} \begin{pmatrix} \sigma^{-2 \cdot 0} \phi_0(x')/\sqrt{0!} \\ \sigma^{-2 \cdot 1} \phi_1(x')/\sqrt{1!} \\ \vdots \end{pmatrix} \right\rangle.$

The feature vector is infinite-dimensional, with one block per polynomial degree, weighted by $1/\sqrt{m!}$ and modulated by the Gaussian envelope. Computing this feature vector explicitly is impossible; computing the kernel takes $O(d)$ operations.

The Gaussian RBF is called universal (Steinwart 2001): its associated RKHS is dense in $C(\mathcal X)$ for any compact $\mathcal X$ . So linear methods in the Gaussian RBF feature space can approximate any continuous function uniformly. This is the theoretical reason RBF SVMs are the default choice when you don't know much about your data.

Mercer's Condition

For a function $k(x, x')$ to be a valid kernel (an inner product in some feature space), it must be symmetric and positive semi-definite.

Theorem

Mercer's Condition for Validity of a Kernel

Statement

A symmetric function $k: \mathcal X \times \mathcal X \to \mathbb R$ is a kernel (i.e. there exists a feature map $\phi$ with $k(x, x') = \langle \phi(x), \phi(x')\rangle$ ) if and only if it is positive semi-definite: for every finite collection of points $x_1, \dots, x_n$ and coefficients $\alpha_1, \dots, \alpha_n \in \mathbb R$ ,

$\sum_{i,j=1}^n \alpha_i \alpha_j k(x_i, x_j) \;\ge\; 0.$

Equivalently, the Gram matrix $K \in \mathbb R^{n \times n}$ with entries $K_{ij} = k(x_i, x_j)$ is positive semi-definite for every finite sample.

Intuition

If $k(x, x') = \phi(x)^\top \phi(x')$ , then for any $\alpha$ , $\sum_{ij} \alpha_i \alpha_j k(x_i, x_j) = \sum_{ij} \alpha_i \alpha_j \phi(x_i)^\top \phi(x_j) = \|\sum_i \alpha_i \phi(x_i)\|^2 \ge 0$ . The PSD property is automatic for inner products. Mercer's theorem says the converse: every symmetric PSD function arises as an inner product in some Hilbert space (the RKHS).

Proof Sketch

The forward direction (kernel implies PSD) follows from the inner-product factorization. The reverse direction is Mercer's theorem: a continuous PSD kernel $k$ on a compact domain has a spectral decomposition $k(x, x') = \sum_j \lambda_j \phi_j(x) \phi_j(x')$ with $\lambda_j \ge 0$ and orthonormal $\phi_j$ . The feature map is $\phi(x) = (\sqrt{\lambda_1} \phi_1(x), \sqrt{\lambda_2} \phi_2(x), \dots)$ , and the Hilbert space of square-summable sequences gives the feature space. The full proof uses the spectral theorem for compact self-adjoint operators; see kernels and RKHS for details.

Why It Matters

Mercer's condition is the practical test for whether a similarity function you've defined can be plugged into a kernel SVM, kernel ridge regression, or any other kernel method. Many domain-specific similarities (string kernels, graph kernels, sequence-alignment kernels) fail PSD-ness, at which point the algorithm becomes ill-defined: the kernel SVM dual may have no solution, or kernel PCA may have negative eigenvalues that the method can't interpret. PSD-ness is the floor below which kernel methods break.

Failure Mode

The hyperbolic tangent kernel $k(x, x') = \tanh(\kappa x^\top x' + c)$ is not PSD for all $\kappa, c$ . Many SVM libraries accept it anyway, because the resulting optimization is still well-defined (the SVM is solved via a QP that does not strictly require PSD-ness of $K$ ). The downside is that the optimization can fail or return solutions that don't generalize. The "sigmoid kernel" works in practice for some problems but loses the theoretical guarantees that PSD kernels carry.

report a correction →

Four Canonical Applications

The same trick, four different host algorithms:

Kernel SVM

The SVM dual depends on data only through dot products $x_i^\top x_j$ :

$\max_\alpha \sum_i \alpha_i - \tfrac12 \sum_{ij} \alpha_i \alpha_j y_i y_j \, x_i^\top x_j \quad \text{s.t.} \quad \alpha_i \ge 0, \;\; \sum_i \alpha_i y_i = 0.$

Replace $x_i^\top x_j$ with $k(x_i, x_j)$ , solve the QP, and predict via $\hat f(x) = \sum_i \alpha_i y_i k(x_i, x) + b$ . The result is a nonlinear classifier: polynomial decision boundary with the polynomial kernel, Gaussian "bumps" with the RBF kernel, and so on.

Kernel Ridge Regression

Ridge regression solves $\min_w \|y - Xw\|^2 + \lambda \|w\|^2$ , with closed-form solution $\hat w = (X^\top X + \lambda I)^{-1} X^\top y$ . Predict $\hat y_* = x_*^\top \hat w$ . By the matrix inversion lemma, this can be rewritten using only kernel evaluations:

$\hat y_* = k_*^\top (K + \lambda I)^{-1} y,$

where $K_{ij} = k(x_i, x_j)$ and $(k_*)_i = k(x_i, x_*)$ . This is equivalent to ridge regression in the feature space defined by $k$ . The Bayesian view generalizes to give a full posterior, leading to Gaussian processes.

Kernel PCA

Standard PCA finds the top eigenvectors of $X^\top X / n$ . Kernel PCA finds the top eigenvectors of the kernel matrix $K$ (after centering). The principal components are nonlinear functions of $x$ , useful for nonlinear dimensionality reduction and as a preprocessing step for clustering or visualization. The trick is the same: kernel matrix replaces covariance matrix.

Kernel k-Means

K-means clusters by minimizing within-cluster sum of squared distances. The squared distance $\|x_i - x_j\|^2 = k(x_i, x_i) + k(x_j, x_j) - 2 k(x_i, x_j)$ depends only on kernel values, so k-means can be run entirely on the kernel matrix. The result is spectral clustering: cluster the eigenvectors of the (normalized) kernel matrix instead of the raw data. With an RBF kernel, this lets k-means discover non-convex cluster boundaries that ordinary k-means cannot.

When the Kernel Trick Stops Helping

The kernel trick converts the cost of working in feature space (linear in $D$ ) into the cost of working with the Gram matrix (cubic in $n$ ). For large datasets ( $n \gtrsim 10^5$ ), the $O(n^3)$ cost of forming and inverting $K$ becomes prohibitive, exactly the regime where deep learning's $O(n)$ -per-epoch SGD wins. The trick is a great move when $d$ (or $D$ ) is large and $n$ is moderate; it becomes a burden when $n$ is large.

Practical fixes:

Nyström approximation: approximate $K$ by a low-rank matrix using a subsample of $m \ll n$ "landmark" points. Cost drops to $O(n m^2)$ .
Random Fourier features (Rahimi & Recht 2007): approximate $k(x, x') \approx z(x)^\top z(x')$ with an explicit $D'$ -dimensional random feature map $z$ . Cost drops back to $O(n D')$ but you give up the exactness.
Inducing points and sparse GPs: the GP analog of Nyström.

The fact that finite-dimensional random feature approximations of the Gaussian RBF kernel work well in practice is what unifies kernel methods with two-layer neural networks (Rahimi-Recht 2008) and motivates the neural tangent kernel story for wide networks.

Common Confusions

Watch Out

The kernel trick is not the same as the RKHS theory

The kernel trick is a computational move: replace inner products with kernel evaluations. The RKHS theory is the theoretical framework that explains why this works: every PSD kernel defines a unique Hilbert space of functions, with the representer theorem guaranteeing that optimal predictors live in a finite-dimensional subspace spanned by the training points. You can use the trick without knowing the theory (most practitioners do), but the theory is what guarantees the trick generalizes, gives risk bounds, and connects kernels to Gaussian processes, neural tangent kernels, and approximation theory.

Watch Out

A bigger feature space is not always better

The Gaussian RBF lifts to infinite features and is "universal," but that does not make it the right choice for every problem. A kernel that lifts too aggressively can overfit by representing too many functions; a kernel that lifts too conservatively can underfit. The right kernel matches the geometry of the data, and is often something simple (linear for high-dimensional sparse data like text, polynomial for low-dimensional structured data, RBF as the default for everything else). The bandwidth $\sigma$ of the RBF kernel is a regularizer in disguise.

Watch Out

The kernel trick costs O(n^2) in space and O(n^3) in time

The implicit-feature gain is real, but it comes paired with a Gram-matrix cost. For $n = 10^5$ , the Gram matrix has $10^{10}$ entries (terabytes if stored in floats), and inverting it takes $10^{15}$ floating-point operations. This is why kernel methods lost ground to deep learning around $n = 10^6$ : the kernel trick scales the wrong way in $n$ , while SGD on a neural network scales linearly. Random Fourier features and Nyström approximation rescue moderately large kernel problems but never restore the optimal-cost scaling of SGD-based methods.

Watch Out

Not every similarity function is a kernel

"Cosine similarity is a kernel" is true ( $k(x, x') = x^\top x' / \|x\|\|x'\|$ is symmetric PSD). "Manhattan-distance similarity is a kernel" is false ( $k(x, x') = -\|x - x'\|_1$ violates PSD-ness in general). When designing custom similarities, check Mercer's condition: enumerate small Gram matrices and verify PSD-ness, or construct $k$ from known-PSD building blocks (sums, products, and positive scalar multiples of PSD kernels are PSD; pre-composition with any function is PSD). Verifying PSD-ness on a small test set before plugging into a kernel method saves debugging time downstream.

Summary

The kernel trick replaces $\langle x_i, x_j\rangle$ with $k(x_i, x_j)$ in any inner-product-based algorithm, implicitly lifting data to a (possibly infinite-dimensional) feature space.
Worked example: $k(x, x') = (x^\top x')^2 = \phi(x)^\top \phi(x')$ with $\phi(x) = (x_1^2, \sqrt 2 \, x_1 x_2, x_2^2)$ . The $\sqrt 2$ factor compensates for the cross-term coefficient in the expansion.
The polynomial kernel of degree $p$ in $\mathbb R^d$ implicitly uses $\binom{d+p}{p}$ features.
The Gaussian RBF kernel implicitly uses infinite features (Taylor expansion of $\exp$ ).
Mercer's condition says $k$ is a kernel iff it is symmetric and positive semi-definite; equivalently, every Gram matrix is PSD.
Canonical applications: SVM, ridge regression, PCA, k-means.
The trick is great for moderate $n$ but scales as $O(n^3)$ in time; for large $n$ , random feature maps or deep networks win.

Exercises

ExerciseCore

Problem

For the polynomial kernel $k(x, x') = (x^\top x' + 1)^2$ on $\mathbb R^2$ , construct the feature map $\phi$ explicitly and verify $k(x, x') = \phi(x)^\top \phi(x')$ by direct expansion.

ExerciseCore

Problem

Show that the kernel sum $k(x, x') = k_1(x, x') + k_2(x, x')$ and the kernel product $k(x, x') = k_1(x, x') \cdot k_2(x, x')$ are both valid kernels whenever $k_1, k_2$ are valid kernels.

ExerciseAdvanced

Problem

Show that the Gaussian RBF kernel $k(x, x') = \exp(-\|x - x'\|^2 / (2\sigma^2))$ is positive semi-definite. (You may use the fact that the Fourier transform of a PSD function on $\mathbb R^d$ is a non-negative function, or construct a feature map directly via the Taylor expansion.)

ExerciseResearch

Problem

The random Fourier features method (Rahimi & Recht 2007) approximates the Gaussian RBF kernel by an explicit finite-dimensional feature map. Sketch the construction: how do you choose the random features so that $\mathbb E[z(x)^\top z(x')] = k(x, x')$ ? What is the convergence rate as a function of the number of features $D'$ ?

References

Canonical:

Schölkopf, B., & Smola, A.J. (2002). Learning with Kernels. MIT Press. Ch. 2 (the kernel trick and its computational consequences).
Aizerman, M.A., Braverman, E.M., & Rozonoer, L.I. (1964). "Theoretical foundations of the potential function method in pattern recognition learning." Automation and Remote Control, 25:821–837. (The original kernel-trick paper, decades before SVM.)
Boser, B.E., Guyon, I.M., & Vapnik, V.N. (1992). "A Training Algorithm for Optimal Margin Classifiers." COLT, 144–152. (The first kernel SVM.)
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. Ch. 12.

Current:

Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press. Ch. 16 (kernel methods).
Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Ch. 17 (kernels and Gaussian processes).
Rahimi, A., & Recht, B. (2007). "Random Features for Large-Scale Kernel Machines." NeurIPS, 1177–1184. (Random Fourier features; the bridge from kernels to neural networks.)
Rasmussen, C.E., & Williams, C.K.I. (2006). Gaussian Processes for Machine Learning. MIT Press. Ch. 4 (covariance functions / kernels).

Next Topics

Kernels and RKHS: the theoretical framework that makes the kernel trick rigorous — covering Mercer's theorem, the reproducing property, and the representer theorem.
Gaussian processes for ML: the Bayesian counterpart of kernel ridge regression with full posterior over functions.
Bayesian linear regression: the finite-feature, fully derived posterior that the kernel trick lifts into infinite dimensions.

Last reviewed: May 10, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Common Probability Distributionslayer 0A · tier 1
Convex Optimization Basicslayer 1 · tier 1
Ridge Regressionlayer 1 · tier 1
Bayesian Linear Regressionlayer 2 · tier 1
Support Vector Machineslayer 2 · tier 1

Derived topics

4

Gram Matrices and Kernel Matriceslayer 1 · tier 1
Gaussian Process Regressionlayer 3 · tier 2
Kernels and Reproducing Kernel Hilbert Spaceslayer 3 · tier 2
Gaussian Processes for Machine Learninglayer 4 · tier 3

Graph-backed continuations

Kernels and Reproducing Kernel Hilbert Spaces Gaussian Processes for Machine Learning Gaussian Process Regression Gram Matrices and Kernel Matrices