Skip to main content

Concentration Probability

Stochastic Processes for ML

Random processes as the unifying framework behind Gaussian processes, neural tangent kernels, empirical processes, and martingale concentration: definitions, sample-path regularity, and the bridges between learning theory's main probabilistic tools.

AdvancedTier 2StableSupporting~75 min

Why This Matters

A surprising amount of modern learning theory is one branch of stochastic- process theory under different names. Gaussian processes are random functions indexed by inputs. Empirical processes are random functions indexed by hypotheses. The neural tangent kernel is the Gaussian process limit of an infinitely wide neural network at initialization. McDiarmid's inequality and SGD convergence both rely on martingale concentration.

These look like separate topics until you see the underlying object: a random function X=(Xt)tTX = (X_t)_{t \in T} on an index set TT, with covariance k(s,t)k(s, t) controlling the geometry of the process. Once you internalize that, results transfer between the views without re-deriving them.

Mental Model

A stochastic process is a probability distribution over the function space RT\mathbb{R}^T. You can take three different perspectives on the same object, and learning theory uses all three:

  • Pointwise (marginal): at each fixed tTt \in T, XtX_t is a real-valued random variable. Most concentration tools (Hoeffding, Bernstein) apply here.
  • Path-wise (sample function): a single draw is a function ω(tXt(ω))\omega \mapsto (t \mapsto X_t(\omega)). Smoothness, continuity, and Hölder regularity apply here.
  • Spectral (covariance kernel): the second-order structure k(s,t)=E[XsXt]k(s, t) = \mathbb{E}[X_s X_t] encodes everything for Gaussian processes and most of what matters for learning theory.

Switching between these views is the core skill. The empirical-process supremum supfFGn(f)\sup_{f \in \mathcal{F}} G_n(f) is path-wise; bounding it via Dudley's integral uses the spectral view (covering numbers in the canonical metric); proving the bound is tight uses Sudakov's pointwise comparison.

Formal Setup

Definition

Stochastic Process

A stochastic process indexed by a set TT on a probability space (Ω,F,P)(\Omega, \mathcal{F}, P) is a collection of random variables {Xt:tT}\{X_t : t \in T\}, each measurable into a state space (here always R\mathbb{R}). When T=R0T = \mathbb{R}_{\geq 0} the process is in continuous time; when TRdT \subset \mathbb{R}^d the process is a random field; when T=FT = \mathcal{F} a function class, the process is an empirical process.

Definition

Covariance Kernel

For a centered process (E[Xt]=0\mathbb{E}[X_t] = 0), the covariance kernel is:

k(s,t)=E[XsXt]k(s, t) = \mathbb{E}[X_s X_t]

The kernel is symmetric (k(s,t)=k(t,s)k(s, t) = k(t, s)) and positive semidefinite (any nn-by-nn Gram matrix [k(ti,tj)][k(t_i, t_j)] is PSD). By Bochner-Mercer-style theorems, every PSD kernel arises as the covariance of some stochastic process.

The canonical metric induced by the process is:

d(s,t)=E[(XsXt)2]=k(s,s)2k(s,t)+k(t,t)d(s, t) = \sqrt{\mathbb{E}[(X_s - X_t)^2]} = \sqrt{k(s,s) - 2k(s,t) + k(t,t)}

For a centered process, the canonical metric depends only on kk.

Definition

Gaussian Process

A stochastic process (Xt)tT(X_t)_{t \in T} is a Gaussian process with mean zero and covariance kk if for any finite collection t1,,tnTt_1, \ldots, t_n \in T, the random vector (Xt1,,Xtn)(X_{t_1}, \ldots, X_{t_n}) is jointly Gaussian with covariance matrix Kij=k(ti,tj)K_{ij} = k(t_i, t_j).

By Kolmogorov's extension theorem, any PSD kernel kk on T×TT \times T defines a Gaussian process; the existence is automatic. The hard question is sample-path regularity: when is a draw from this process continuous, differentiable, or Hölder?

Sample-Path Regularity

A Gaussian process exists abstractly given a PSD kernel. But whether you can realize it with continuous sample paths is a separate question, controlled by how fast the canonical metric grows.

Theorem

Kolmogorov Continuity Criterion

Statement

Let (Xt)t[0,1]d(X_t)_{t \in [0,1]^d} be a stochastic process such that for some α,β,C>0\alpha, \beta, C > 0:

E[XsXtα]Cstd+βfor all s,t[0,1]d\mathbb{E}[|X_s - X_t|^\alpha] \leq C \|s - t\|^{d + \beta} \quad \text{for all } s, t \in [0,1]^d

Then XX admits a modification X~\tilde{X} (i.e., X~t=Xt\tilde{X}_t = X_t a.s. for each tt) with sample paths that are locally γ\gamma-Hölder continuous for every γ<β/α\gamma < \beta/\alpha.

Source: Karatzas-Shreve (1991), Theorem 2.2.8; Stroock-Varadhan (1979), Theorem 2.1.6.

Intuition

The criterion translates a moment bound on increments into a path-regularity guarantee. Brownian motion has Gaussian increments with E[BsBtα]stα/2\mathbb{E}[|B_s - B_t|^\alpha] \asymp |s - t|^{\alpha/2} for any α\alpha, so taking α\alpha large gives continuity with γ<1/2\gamma < 1/2 — Brownian paths are (1/2ϵ)(1/2 - \epsilon)-Hölder for every ϵ>0\epsilon > 0, and not better (this is sharp).

For a Gaussian process with kernel satisfying k(s,s)2k(s,t)+k(t,t)Cst2βk(s, s) - 2 k(s, t) + k(t,t) \leq C \|s - t\|^{2\beta}, all moments α\alpha are controlled by the variance, and Kolmogorov gives continuity with Hölder exponent up to β\beta.

Proof Sketch

Discretize [0,1]d[0,1]^d into a dyadic grid at scale 2n2^{-n}. The number of grid points is 2nd2^{nd}, and each adjacent pair has increment moment bounded by C2n(d+β)C \cdot 2^{-n(d+\beta)}. Markov plus union bound gives a uniform increment bound at each scale, summable in nn for γ<β/α\gamma < \beta/\alpha. The resulting series is finite a.s., and the partial sums define a continuous modification.

Why It Matters

This is the bridge from spectral information (a kernel decay rate) to path-wise information (Hölder regularity of GP samples). For the squared- exponential kernel k(s,t)=exp(st2/(22))k(s, t) = \exp(-\|s-t\|^2/(2\ell^2)), the canonical metric satisfies d(s,t)st/d(s, t) \asymp \|s - t\|/\ell, so samples are infinitely Hölder. For the Matérn kernel with smoothness ν\nu, samples are exactly (νϵ)(\nu - \epsilon)-Hölder.

This determines whether GP regression with a given kernel can recover functions in a corresponding Hölder class, and at what rate.

Failure Mode

The criterion is sufficient, not necessary. A process can be continuous without satisfying the moment bound (e.g., almost-surely-constant processes). Conversely, a process whose increment moments fail the criterion may have a discontinuous modification but still be continuous along specific sub-sequences. For non-Gaussian processes with heavy-tailed increments, the criterion can give weaker continuity than the truth.

Brownian Motion: The Canonical Gaussian Process

Brownian motion (the Wiener process) is the prototypical Gaussian process: it has the simplest possible kernel and yet exhibits all the path-regularity phenomena that complicate general processes.

A standard Brownian motion (Bt)t0(B_t)_{t \geq 0} is the centered Gaussian process with covariance:

kBM(s,t)=min(s,t)k_{\text{BM}}(s, t) = \min(s, t)

The canonical metric is d(s,t)=std(s, t) = \sqrt{|s - t|}, growing as the square root of st|s - t|. By Kolmogorov continuity, BtB_t has a continuous modification with paths that are (1/2ϵ)(1/2 - \epsilon)-Hölder.

Why this is canonical. Many stochastic processes used in ML can be written as functionals of Brownian motion:

  • Ornstein-Uhlenbeck process: Xt=0teθ(ts)dBsX_t = \int_0^t e^{-\theta(t-s)}\, dB_s is a stationary GP with exponential kernel k(s,t)=(1/(2θ))eθstk(s, t) = (1/(2\theta)) e^{-\theta|s - t|}. This is the Matérn-1/2 kernel and the noise process used in score-matching diffusion models.
  • Brownian bridge: Xt=BttB1X_t = B_t - t B_1 on [0,1][0, 1] has kernel k(s,t)=min(s,t)stk(s, t) = \min(s, t) - st and is the limit process for empirical CDFs (Donsker's theorem).
  • Fractional Brownian motion: GP with kH(s,t)=12(s2H+t2Hst2H)k_H(s, t) = \frac{1}{2}(|s|^{2H} + |t|^{2H} - |s - t|^{2H}) for H(0,1)H \in (0, 1). Reduces to Brownian motion at H=1/2H = 1/2.

Quadratic variation. For Brownian motion, i(BtiBti1)2t\sum_i (B_{t_i} - B_{t_{i-1}})^2 \to t in L2L^2 as the partition mesh 0\to 0. This is why Itô calculus has a second-order correction term: ordinary Riemann integration fails because dBtdB_t has nontrivial variance even at infinitesimal scale.

Connection to Empirical Processes

An empirical process Gn(f)=n(PnfPf)G_n(f) = \sqrt{n}(P_n f - P f) over a function class F\mathcal{F} is a stochastic process indexed by F\mathcal{F}. The covariance kernel is:

k(f,g)=Cov(f(X),g(X))=P(fg)(Pf)(Pg)k(f, g) = \mathrm{Cov}(f(X), g(X)) = P(fg) - (Pf)(Pg)

The canonical metric is d(f,g)=Var(f(X)g(X))=fgL2(P)d(f, g) = \sqrt{\mathrm{Var}(f(X) - g(X))} = \|f - g\|_{L_2(P)}.

Under regularity, the empirical process converges (after centering) to a Gaussian process indexed by F\mathcal{F} with this covariance:

Theorem

Donsker's Theorem (Functional CLT)

Statement

Let F\mathcal{F} be a Donsker class for PP. Then the empirical process GnG_n converges in distribution to a centered Gaussian process GPG_P indexed by F\mathcal{F} with covariance Cov(GP(f),GP(g))=P(fg)(Pf)(Pg)\mathrm{Cov}(G_P(f), G_P(g)) = P(fg) - (Pf)(Pg):

GnGPin (F)G_n \rightsquigarrow G_P \quad \text{in } \ell^\infty(\mathcal{F})

Source: van der Vaart-Wellner (1996), Theorem 2.5.2; classical case (empirical CDF, F={1(,t]}\mathcal{F} = \{\mathbf{1}_{(-\infty, t]}\}) is the original Donsker theorem (1952), with GPG_P the Brownian bridge.

Intuition

This is the function-space analog of the central limit theorem. For the empirical CDF on the real line, the limit process is the Brownian bridge, which has covariance k(s,t)=min(F(s),F(t))F(s)F(t)k(s, t) = \min(F(s), F(t)) - F(s) F(t). For more complex classes (VC, L2L_2-bounded RKHS balls), Donsker's theorem requires the class to be measurable and have finite uniform entropy.

The practical consequence: bounding the finite-sample supremum of GnG_n via chaining is essentially the same as bounding the supremum of the limit Gaussian process GPG_P. Dudley's entropy integral works for both.

Proof Sketch

The proof has three pieces: (1) finite-dimensional convergence by the classical multivariate CLT applied to any finite collection (Gn(f1),,Gn(fk))(G_n(f_1), \ldots, G_n(f_k)); (2) uniform asymptotic equicontinuity, which follows from finite uniform entropy (the Donsker class condition); (3) tightness in (F)\ell^\infty(\mathcal{F}) via the equicontinuity, by an Arzelà-Ascoli style argument. Putting the pieces together gives weak convergence in the function-space sense.

Why It Matters

Donsker's theorem is the reason that learning theory results derived for Gaussian processes (Sudakov-Fernique, generic chaining, Borell-TIS concentration) transfer to empirical processes. The two are the same object in the limit, modulo finite-sample corrections.

Concretely, this is what licenses using n\sqrt{n}-rate Gaussian-process heuristics in finite-sample analysis. Bracketing entropy (van der Vaart- Wellner Sec. 2.7) gives a more refined version that also handles VC and parametric classes.

Failure Mode

Donsker's theorem requires the class to be Donsker, which excludes sufficiently rich classes (e.g., the class of all measurable indicator functions, or VC classes of dimension growing with nn). For non-Donsker classes, you fall back to direct chaining bounds without an asymptotic GP limit.

Neural Networks as Gaussian Processes

The neural-network-Gaussian-process correspondence is one of the cleanest applications of stochastic-process theory to deep learning. At initialization, an infinitely wide neural network with i.i.d. Gaussian weights is a Gaussian process, with a kernel computable in closed form.

Theorem

Neural Network Gaussian Process Correspondence

Statement

Let f(L)(x;θ)=W(L)ϕ(W(L1)ϕ(ϕ(W(1)x+b(1)))+b(L1))+b(L)f^{(L)}(x; \theta) = W^{(L)} \phi(W^{(L-1)} \phi(\cdots \phi(W^{(1)} x + b^{(1)}) \cdots) + b^{(L-1)}) + b^{(L)} be a fully-connected network with the random initialization above. As all hidden widths mlm_l \to \infty jointly, the prelogit output converges in distribution (over the random init) to a centered Gaussian process f(L)GP(0,k(L))f^{(L)} \sim \mathcal{GP}(0, k^{(L)}) where the NNGP kernel k(L)k^{(L)} is defined recursively by:

k(0)(x,x)=σw2xxd0+σb2k^{(0)}(x, x') = \sigma_w^2 \cdot \frac{x^\top x'}{d_0} + \sigma_b^2

k(l)(x,x)=σw2E(u,v)N(0,K(l1)) ⁣[ϕ(u)ϕ(v)]+σb2k^{(l)}(x, x') = \sigma_w^2 \cdot \mathbb{E}_{(u, v) \sim \mathcal{N}(0, K^{(l-1)})}\!\bigl[\phi(u) \phi(v)\bigr] + \sigma_b^2

where K(l1)K^{(l-1)} is the 2×2 covariance matrix [k(l1)(x,x)k(l1)(x,x)k(l1)(x,x)k(l1)(x,x)]\bigl[\begin{smallmatrix} k^{(l-1)}(x,x) & k^{(l-1)}(x, x') \\ k^{(l-1)}(x',x) & k^{(l-1)}(x',x') \end{smallmatrix}\bigr].

Source: Neal (1996) for one hidden layer; Lee, Bahri, Novak, Schoenholz, Pennington, Sohl-Dickstein (2018), "Deep Neural Networks as Gaussian Processes," ICLR; Matthews et al. (2018), "Gaussian Process Behaviour in Wide Deep Neural Networks," ICLR.

Intuition

At each hidden layer, the activations are sums of mm i.i.d. terms (one per neuron). The classical CLT says the sum, scaled by 1/m1/\sqrt{m}, converges to a Gaussian whose variance equals the per-term variance. This is the key step: at infinite width, every layer's preactivations are Gaussian, and the covariance is determined by the previous layer's covariance through the recursion above.

The recursion is computable in closed form for ReLU (Cho-Saul 2009 arccosine kernel) and analytically for many activations.

Proof Sketch

Iterate the central limit theorem layer by layer. At layer ll, conditional on the layer-(l1)(l-1) activations being Gaussian with covariance K(l1)K^{(l-1)}, the preactivations at layer ll are sums of ml1m_{l-1} i.i.d. terms. By multivariate CLT (with vanishing higher moments under the Lipschitz condition), the limit is Gaussian with covariance σw2E[ϕ(u)ϕ(v)]+σb2\sigma_w^2 \mathbb{E}[\phi(u)\phi(v)] + \sigma_b^2. The technical core is showing the convergence is uniform over inputs x,xx, x', which requires controlling the L2L^2 continuity of ϕ\phi and the boundedness of K(l1)K^{(l-1)}. See Lee et al. (2018) Theorem 1 for the formal statement.

Why It Matters

The NNGP correspondence enables exact Bayesian inference with infinite-width networks: you compute the NNGP kernel for your architecture, then do GP regression with that kernel. The posterior is in closed form.

The closely related neural tangent kernel captures the training dynamics of infinite-width networks: under gradient flow with infinitesimal learning rate, the NTK is constant during training, and the trained network's predictions equal NTK kernel ridge regression. NNGP is for inference at init; NTK is for training trajectories.

This gives a complete theory of infinite-width networks: they are Gaussian processes both before and (in a different sense) after training. The hard open question is what finite width changes, since real networks have features that infinite-width theory cannot model.

Failure Mode

The correspondence holds in the infinite-width limit. At finite width, correction terms appear; the leading correction is O(1/m)O(1/m) for kernel quantities. For modern transformers (where the relevant scaling involves sequence length and depth, not just width), the NNGP correspondence is more delicate; see Yang's "Tensor Programs" series for the modern formulation.

The correspondence does not capture feature learning: in the NNGP/NTK limits, the network's intermediate representations are frozen at their initialization values. Real networks learn features, which is why finite- width theory matters for explaining modern deep learning.

Martingales: Concentration for Adapted Sequences

Empirical processes assume i.i.d. sampling. Many ML problems involve adapted data: the next sample depends on the past (online learning, RL, sequential experiments). Martingales are the right generalization.

Definition

Martingale

A sequence (Mn)n0(M_n)_{n \geq 0} adapted to a filtration (Fn)n0(\mathcal{F}_n)_{n \geq 0} is a martingale if for all n0n \geq 0:

E[Mn]<andE[Mn+1Fn]=Mn\mathbb{E}[|M_n|] < \infty \quad \text{and} \quad \mathbb{E}[M_{n+1} \mid \mathcal{F}_n] = M_n

The increments Dn=MnMn1D_n = M_n - M_{n-1} form a martingale difference sequence (MDS) with E[DnFn1]=0\mathbb{E}[D_n \mid \mathcal{F}_{n-1}] = 0.

Theorem

Azuma-Hoeffding for Bounded MDS

Statement

Let (Dn)n1(D_n)_{n \geq 1} be a martingale difference sequence with Dnbn|D_n| \leq b_n a.s. for all nn. Let Sn=i=1nDiS_n = \sum_{i=1}^n D_i. Then for all t>0t > 0:

P(Snt)2exp ⁣(t22i=1nbi2)\mathbb{P}(|S_n| \geq t) \leq 2 \exp\!\Bigl(-\frac{t^2}{2 \sum_{i=1}^n b_i^2}\Bigr)

Source: Azuma (1967), "Weighted sums of certain dependent random variables," Tohoku Math. J. 19; Hoeffding (1963), Theorem 2.

Intuition

This is Hoeffding's inequality without the independence assumption: the martingale difference structure (E[DiFi1]=0\mathbb{E}[D_i \mid \mathcal{F}_{i-1}] = 0) replaces independence and gives the same sub-Gaussian tail. The constant bi2\sum b_i^2 is the "effective variance bound" in the worst case; Bernstein- type tightening (Freedman's inequality) replaces it with the conditional variance E[Di2Fi1]\sum \mathbb{E}[D_i^2 \mid \mathcal{F}_{i-1}].

This is what powers concentration for adaptive ML algorithms: in online learning, the cumulative regret is a martingale by construction (each new loss is centered conditional on the algorithm's prediction), so Azuma- Hoeffding gives anytime regret bounds.

Proof Sketch

The proof mirrors scalar Hoeffding. For each ii, the conditional MGF satisfies E[eλDiFi1]eλ2bi2/2\mathbb{E}[e^{\lambda D_i} \mid \mathcal{F}_{i-1}] \leq e^{\lambda^2 b_i^2 / 2} by Hoeffding's lemma. Then E[eλSn]=E[eλSn1E[eλDnFn1]]eλ2bn2/2E[eλSn1]\mathbb{E}[e^{\lambda S_n}] = \mathbb{E}[e^{\lambda S_{n-1}} \mathbb{E}[e^{\lambda D_n} \mid \mathcal{F}_{n-1}]] \leq e^{\lambda^2 b_n^2/2} \mathbb{E}[e^{\lambda S_{n-1}}], iterate over nn, optimize λ\lambda. The inner conditioning step is exactly where independence is replaced by the MDS property.

Why It Matters

  • McDiarmid's inequality (the workhorse for generalization bounds) is proved by applying Azuma-Hoeffding to the Doob martingale of any bounded-differences function.
  • Online learning regret bounds: Hedge, EXP3, UCB all rely on Azuma-Hoeffding to control cumulative deviations.
  • Stochastic approximation / SGD convergence: noisy gradient terms form an MDS, and Azuma-Hoeffding gives anytime convergence rates.
  • Doob martingales for sums under arbitrary dependence: when f(X1,,Xn)f(X_1, \ldots, X_n) has bounded differences in each coordinate, the process Mk=E[fX1,,Xk]M_k = \mathbb{E}[f \mid X_1, \ldots, X_k] is a bounded MDS, giving McDiarmid's exp(2t2/ci2)\exp(-2t^2 / \sum c_i^2) tail.

Failure Mode

Azuma-Hoeffding is loose by a factor of 4 compared to McDiarmid's classical form for bounded-differences functions, due to the variance bound bi2\sum b_i^2 replacing (bi+bi)2\sum (b_i^+ - b_i^-)^2. For tighter rates, use Freedman's inequality (which replaces bi2b_i^2 by predictable quadratic variation) when an MGF moment bound conditional on the filtration is available.

Reproducing Kernel Hilbert Space Connection

Every PSD kernel kk defines both a Gaussian process and a reproducing kernel Hilbert space (RKHS) Hk\mathcal{H}_k. These are dual representations of the same kernel:

  • GP view: kk generates random functions fGP(0,k)f \sim \mathcal{GP}(0, k). Sample paths typically lie outside Hk\mathcal{H}_k with probability one.
  • RKHS view: kk generates a deterministic Hilbert space of functions. Each fHkf \in \mathcal{H}_k admits the reproducing property f(x)=f,k(,x)Hkf(x) = \langle f, k(\cdot, x) \rangle_{\mathcal{H}_k}.

The connection: the posterior mean of GP regression with kernel kk on data {(xi,yi)}\{(x_i, y_i)\} is exactly the kernel ridge regression solution in Hk\mathcal{H}_k:

f^GP(x)=k(K+σn2I)1y\hat{f}_{\text{GP}}(x) = \mathbf{k}_*^\top (K + \sigma_n^2 I)^{-1} \mathbf{y}

where Kij=k(xi,xj)K_{ij} = k(x_i, x_j) and (k)i=k(xi,x)(\mathbf{k}_*)_i = k(x_i, x). This is also the minimizer of i(yif(xi))2+σn2fHk2\sum_i (y_i - f(x_i))^2 + \sigma_n^2 \|f\|^2_{\mathcal{H}_k} over fHkf \in \mathcal{H}_k.

Why the GP samples are outside the RKHS. The RKHS norm penalizes high-frequency components heavily, while GP samples have a typical RKHS norm of \infty (they are "rougher" than RKHS functions). Driscoll's zero-one law states: a GP sample lies in the RKHS with probability either 00 or 11, depending only on the trace condition jλj<\sum_j \lambda_j < \infty where λj\lambda_j are the eigenvalues of kk as an integral operator. For "typical" kernels this trace is infinite, so samples lie outside Hk\mathcal{H}_k almost surely. (Lukić-Beder, 2001, gave the cleanest modern statement.)

This is the resolution to a common confusion: the GP defines a prior distribution over functions, but the posterior mean is the RKHS-norm- minimizing fit, and the prior support is not the RKHS itself.

Common Confusions

Watch Out

GP samples are not in the RKHS

Driscoll's theorem: with probability one, samples from GP(0,k)\mathcal{GP}(0, k) do not lie in the RKHS Hk\mathcal{H}_k (when the kernel has infinite trace, which is the typical case). Yet the posterior mean of GP regression is the kernel ridge solution in Hk\mathcal{H}_k. The GP prior is a distribution on a larger function space than Hk\mathcal{H}_k; the posterior concentrates within an RKHS-shaped tube around the data.

Watch Out

NNGP and NTK are different limits

The NNGP kernel describes the prior distribution at random initialization: the network output is GP(0,kNNGP)\mathcal{GP}(0, k_{\text{NNGP}}) before any training. The NTK describes the gradient flow dynamics: under infinitesimal learning rate, the trained network's predictions equal a specific NTK kernel ridge regression. Both apply at infinite width but to different objects (init distribution vs. trained predictor).

Watch Out

Stationary vs. non-stationary kernels

A kernel is stationary if k(s,t)=κ(st)k(s, t) = \kappa(s - t) for some function κ\kappa. Brownian motion's kernel k(s,t)=min(s,t)k(s, t) = \min(s, t) is not stationary; the variance k(t,t)=tk(t, t) = t grows with tt. Brownian bridge's kernel k(s,t)=min(s,t)stk(s, t) = \min(s, t) - st is also not stationary. RBF and Matérn are stationary. Stationarity matters for spectral analysis (Bochner's theorem) but not for sample-path regularity.

Watch Out

Continuity of paths vs. continuity of moments

A process can have continuous covariance kernel but discontinuous sample paths, and vice versa. Kolmogorov's continuity criterion is a moment-bound sufficient condition for the existence of a continuous modification. The canonical example: a process that equals 0 except at independently chosen random times has discontinuous samples but trivially continuous moments.

Canonical Examples

Example

Squared-exponential kernel: smooth GP

The kernel k(s,t)=exp(st2/(22))k(s, t) = \exp(-\|s - t\|^2 / (2\ell^2)) has canonical metric d(s,t)st/d(s, t) \asymp \|s - t\|/\ell. By Kolmogorov continuity (with all moments α\alpha), samples are infinitely Hölder. In fact, samples are real-analytic on any compact set, with high probability.

The corresponding RKHS is the space of band-limited functions with exponentially decaying Fourier transform — a very small Hilbert space. Thus GP samples are far smoother than typical RKHS functions, but the GP posterior mean is in this RKHS.

Example

Matérn kernel: tunable smoothness

The Matérn kernel kν(s,t)=21νΓ(ν)(2νst/)νKν ⁣(2νst/)k_\nu(s, t) = \frac{2^{1-\nu}}{\Gamma(\nu)} \bigl(\sqrt{2\nu}\,\|s - t\|/\ell\bigr)^\nu K_\nu\!\bigl(\sqrt{2\nu}\,\|s - t\|/\ell\bigr) (where KνK_\nu is the modified Bessel function of the second kind) has samples that are exactly ν\nu-times mean-square differentiable. At ν=1/2\nu = 1/2, samples are Brownian-motion-rough; at ν\nu \to \infty, samples become smooth like the squared-exponential.

This makes Matérn the standard kernel in spatial statistics: you choose ν\nu to match the regularity of the underlying signal.

Example

Empirical CDF as a Brownian bridge

For i.i.d. uniform XiU[0,1]X_i \sim U[0,1], the empirical process Gn(t)=n(F^n(t)t)G_n(t) = \sqrt{n}(\hat{F}_n(t) - t) converges weakly to the Brownian bridge Bt0B^0_t with kernel k(s,t)=min(s,t)stk(s, t) = \min(s, t) - st. This is the original Donsker theorem (1952), and explains why Kolmogorov-Smirnov test statistics are distribution-free.

Exercises

ExerciseCore

Problem

Verify that the Brownian bridge kernel k(s,t)=min(s,t)stk(s, t) = \min(s, t) - st on [0,1][0,1] is positive semidefinite. Compute the canonical metric d(s,t)d(s, t) and explain why a Brownian bridge sample is continuous but not differentiable.

ExerciseAdvanced

Problem

Compute the NNGP kernel for a one-hidden-layer ReLU network with input xRdx \in \mathbb{R}^d and weights Wij(1)N(0,σw2/d)W^{(1)}_{ij} \sim \mathcal{N}(0, \sigma_w^2/d), no biases. Verify it equals the arc-cosine kernel of order 1 (Cho-Saul 2009).

ExerciseAdvanced

Problem

A Doob martingale for f(X1,,Xn)f(X_1, \ldots, X_n) is defined by Mk=E[fX1,,Xk]M_k = \mathbb{E}[f \mid X_1, \ldots, X_k]. If ff has bounded differences f(,xi,)f(,xi,)ci|f(\ldots, x_i, \ldots) - f(\ldots, x_i', \ldots)| \leq c_i, show that the MDS Dk=MkMk1D_k = M_k - M_{k-1} satisfies Dkck|D_k| \leq c_k a.s. Apply Azuma- Hoeffding to recover McDiarmid's inequality.

ExerciseResearch

Problem

State Borell-TIS (Tsirelson-Ibragimov-Sudakov) concentration for Gaussian processes: a sub-Gaussian tail for suptXt\sup_t X_t around its expectation. Compare to Azuma-Hoeffding for empirical processes; explain why both give the same rate but Borell-TIS has a sharper constant.

References

Canonical:

  • Adler & Taylor (2007), Random Fields and Geometry, Chapters 1-2 (covariance kernels, sample-path regularity, Borell-TIS)
  • Karatzas & Shreve (1991), Brownian Motion and Stochastic Calculus, Chapter 2 (Kolmogorov continuity, Brownian motion construction)
  • Rasmussen & Williams (2006), Gaussian Processes for Machine Learning, Chapters 2 and 4 (GP regression, kernel zoo)
  • Durrett (2019), Probability: Theory and Examples (5th ed.), Chapters 4-7 (martingales, Brownian motion)
  • van Handel (2016), Probability in High Dimension, Chapters 5-7 (Borell-TIS, Sudakov-Fernique, Gaussian processes for empirical processes)

Current:

  • Lee, Bahri, Novak, Schoenholz, Pennington, Sohl-Dickstein (2018), "Deep Neural Networks as Gaussian Processes," ICLR, arXiv:1711.00165
  • Matthews, Hron, Rowland, Turner, Ghahramani (2018), "Gaussian Process Behaviour in Wide Deep Neural Networks," ICLR, arXiv:1804.11271
  • Jacot, Gabriel, Hongler (2018), "Neural Tangent Kernel: Convergence and Generalization in Neural Networks," NeurIPS, arXiv:1806.07572
  • Cho & Saul (2009), "Kernel Methods for Deep Learning," NeurIPS (arc-cosine kernels)

Frontier:

  • Yang (2019-2021), "Tensor Programs" series (I-V), arXiv:1910.12478, 2006.14548, 2009.10685 — modern formulation of NNGP/NTK for arbitrary architectures including transformers
  • Roberts, Yaida, Hanin (2022), The Principles of Deep Learning Theory — finite-width 1/m1/m corrections to NNGP

Next Topics

Last reviewed: May 5, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

3

Derived topics

0

No published topic currently declares this as a prerequisite.