Stochastic Processes for ML

Sneiderman, Robby

Concentration Probability

Stochastic Processes for ML

Random processes as the unifying framework behind Gaussian processes, neural tangent kernels, empirical processes, and martingale concentration: definitions, sample-path regularity, and the bridges between learning theory's main probabilistic tools.

AdvancedTier 2StableSupporting~75 min

Prerequisites

Measure Theoretic Probability Expectation Variance Covariance Moments Concentration Inequalities

Quiz (5)Pulse Check Prereq Map

Why This Matters

A surprising amount of modern learning theory is one branch of stochastic- process theory under different names. Gaussian processes are random functions indexed by inputs. Empirical processes are random functions indexed by hypotheses. The neural tangent kernel is the Gaussian process limit of an infinitely wide neural network at initialization. McDiarmid's inequality and SGD convergence both rely on martingale concentration.

These look like separate topics until you see the underlying object: a random function $X = (X_t)_{t \in T}$ on an index set $T$ , with covariance $k(s, t)$ controlling the geometry of the process. Once you internalize that, results transfer between the views without re-deriving them.

Mental Model

A stochastic process is a probability distribution over the function space $\mathbb{R}^T$ . You can take three different perspectives on the same object, and learning theory uses all three:

Pointwise (marginal): at each fixed $t \in T$ , $X_t$ is a real-valued random variable. Most concentration tools (Hoeffding, Bernstein) apply here.
Path-wise (sample function): a single draw is a function $\omega \mapsto (t \mapsto X_t(\omega))$ . Smoothness, continuity, and Hölder regularity apply here.
Spectral (covariance kernel): the second-order structure $k(s, t) = \mathbb{E}[X_s X_t]$ encodes everything for Gaussian processes and most of what matters for learning theory.

Switching between these views is the core skill. The empirical-process supremum $\sup_{f \in \mathcal{F}} G_n(f)$ is path-wise; bounding it via Dudley's integral uses the spectral view (covering numbers in the canonical metric); proving the bound is tight uses Sudakov's pointwise comparison.

Formal Setup

Definition

Stochastic Process $(X_{t})_{t \in T}$

A stochastic process indexed by a set $T$ on a probability space $(\Omega, \mathcal{F}, P)$ is a collection of random variables $\{X_t : t \in T\}$ , each measurable into a state space (here always $\mathbb{R}$ ). When $T = \mathbb{R}_{\geq 0}$ the process is in continuous time; when $T \subset \mathbb{R}^d$ the process is a random field; when $T = \mathcal{F}$ a function class, the process is an empirical process.

Definition

Covariance Kernel $k (s, t)$

For a centered process ( $\mathbb{E}[X_t] = 0$ ), the covariance kernel is:

$k(s, t) = \mathbb{E}[X_s X_t]$

The kernel is symmetric ( $k(s, t) = k(t, s)$ ) and positive semidefinite (any $n$ -by- $n$ Gram matrix $[k(t_i, t_j)]$ is PSD). By Bochner-Mercer-style theorems, every PSD kernel arises as the covariance of some stochastic process.

The canonical metric induced by the process is:

$d(s, t) = \sqrt{\mathbb{E}[(X_s - X_t)^2]} = \sqrt{k(s,s) - 2k(s,t) + k(t,t)}$

For a centered process, the canonical metric depends only on $k$ .

Definition

Gaussian Process $G P (0, k)$

A stochastic process $(X_t)_{t \in T}$ is a Gaussian process with mean zero and covariance $k$ if for any finite collection $t_1, \ldots, t_n \in T$ , the random vector $(X_{t_1}, \ldots, X_{t_n})$ is jointly Gaussian with covariance matrix $K_{ij} = k(t_i, t_j)$ .

By Kolmogorov's extension theorem, any PSD kernel $k$ on $T \times T$ defines a Gaussian process; the existence is automatic. The hard question is sample-path regularity: when is a draw from this process continuous, differentiable, or Hölder?

Sample-Path Regularity

A Gaussian process exists abstractly given a PSD kernel. But whether you can realize it with continuous sample paths is a separate question, controlled by how fast the canonical metric grows.

Theorem

Kolmogorov Continuity Criterion

Statement

Let $(X_t)_{t \in [0,1]^d}$ be a stochastic process such that for some $\alpha, \beta, C > 0$ :

$\mathbb{E}[|X_s - X_t|^\alpha] \leq C \|s - t\|^{d + \beta} \quad \text{for all } s, t \in [0,1]^d$

Then $X$ admits a modification $\tilde{X}$ (i.e., $\tilde{X}_t = X_t$ a.s. for each $t$ ) with sample paths that are locally $\gamma$ -Hölder continuous for every $\gamma < \beta/\alpha$ .

Source: Karatzas-Shreve (1991), Theorem 2.2.8; Stroock-Varadhan (1979), Theorem 2.1.6.

Intuition

The criterion translates a moment bound on increments into a path-regularity guarantee. Brownian motion has Gaussian increments with $\mathbb{E}[|B_s - B_t|^\alpha] \asymp |s - t|^{\alpha/2}$ for any $\alpha$ , so taking $\alpha$ large gives continuity with $\gamma < 1/2$ — Brownian paths are $(1/2 - \epsilon)$ -Hölder for every $\epsilon > 0$ , and not better (this is sharp).

For a Gaussian process with kernel satisfying $k(s, s) - 2 k(s, t) + k(t,t) \leq C \|s - t\|^{2\beta}$ , all moments $\alpha$ are controlled by the variance, and Kolmogorov gives continuity with Hölder exponent up to $\beta$ .

Proof Sketch

Discretize $[0,1]^d$ into a dyadic grid at scale $2^{-n}$ . The number of grid points is $2^{nd}$ , and each adjacent pair has increment moment bounded by $C \cdot 2^{-n(d+\beta)}$ . Markov plus union bound gives a uniform increment bound at each scale, summable in $n$ for $\gamma < \beta/\alpha$ . The resulting series is finite a.s., and the partial sums define a continuous modification.

Why It Matters

This is the bridge from spectral information (a kernel decay rate) to path-wise information (Hölder regularity of GP samples). For the squared- exponential kernel $k(s, t) = \exp(-\|s-t\|^2/(2\ell^2))$ , the canonical metric satisfies $d(s, t) \asymp \|s - t\|/\ell$ , so samples are infinitely Hölder. For the Matérn kernel with smoothness $\nu$ , samples are exactly $(\nu - \epsilon)$ -Hölder.

This determines whether GP regression with a given kernel can recover functions in a corresponding Hölder class, and at what rate.

Failure Mode

The criterion is sufficient, not necessary. A process can be continuous without satisfying the moment bound (e.g., almost-surely-constant processes). Conversely, a process whose increment moments fail the criterion may have a discontinuous modification but still be continuous along specific sub-sequences. For non-Gaussian processes with heavy-tailed increments, the criterion can give weaker continuity than the truth.

report a correction →

Brownian Motion: The Canonical Gaussian Process

Brownian motion (the Wiener process) is the prototypical Gaussian process: it has the simplest possible kernel and yet exhibits all the path-regularity phenomena that complicate general processes.

A standard Brownian motion $(B_t)_{t \geq 0}$ is the centered Gaussian process with covariance:

$k_{\text{BM}}(s, t) = \min(s, t)$

The canonical metric is $d(s, t) = \sqrt{|s - t|}$ , growing as the square root of $|s - t|$ . By Kolmogorov continuity, $B_t$ has a continuous modification with paths that are $(1/2 - \epsilon)$ -Hölder.

Why this is canonical. Many stochastic processes used in ML can be written as functionals of Brownian motion:

Ornstein-Uhlenbeck process: $X_t = \int_0^t e^{-\theta(t-s)}\, dB_s$ is a stationary GP with exponential kernel $k(s, t) = (1/(2\theta)) e^{-\theta|s - t|}$ . This is the Matérn-1/2 kernel and the noise process used in score-matching diffusion models.
Brownian bridge: $X_t = B_t - t B_1$ on $[0, 1]$ has kernel $k(s, t) = \min(s, t) - st$ and is the limit process for empirical CDFs (Donsker's theorem).
Fractional Brownian motion: GP with $k_H(s, t) = \frac{1}{2}(|s|^{2H} + |t|^{2H} - |s - t|^{2H})$ for $H \in (0, 1)$ . Reduces to Brownian motion at $H = 1/2$ .

Quadratic variation. For Brownian motion, $\sum_i (B_{t_i} - B_{t_{i-1}})^2 \to t$ in $L^2$ as the partition mesh $\to 0$ . This is why Itô calculus has a second-order correction term: ordinary Riemann integration fails because $dB_t$ has nontrivial variance even at infinitesimal scale.

Connection to Empirical Processes

An empirical process $G_n(f) = \sqrt{n}(P_n f - P f)$ over a function class $\mathcal{F}$ is a stochastic process indexed by $\mathcal{F}$ . The covariance kernel is:

$k(f, g) = \mathrm{Cov}(f(X), g(X)) = P(fg) - (Pf)(Pg)$

The canonical metric is $d(f, g) = \sqrt{\mathrm{Var}(f(X) - g(X))} = \|f - g\|_{L_2(P)}$ .

Under regularity, the empirical process converges (after centering) to a Gaussian process indexed by $\mathcal{F}$ with this covariance:

Theorem

Donsker's Theorem (Functional CLT)

Statement

Let $\mathcal{F}$ be a Donsker class for $P$ . Then the empirical process $G_n$ converges in distribution to a centered Gaussian process $G_P$ indexed by $\mathcal{F}$ with covariance $\mathrm{Cov}(G_P(f), G_P(g)) = P(fg) - (Pf)(Pg)$ :

$G_n \rightsquigarrow G_P \quad \text{in } \ell^\infty(\mathcal{F})$

Source: van der Vaart-Wellner (1996), Theorem 2.5.2; classical case (empirical CDF, $\mathcal{F} = \{\mathbf{1}_{(-\infty, t]}\}$ ) is the original Donsker theorem (1952), with $G_P$ the Brownian bridge.

Intuition

This is the function-space analog of the central limit theorem. For the empirical CDF on the real line, the limit process is the Brownian bridge, which has covariance $k(s, t) = \min(F(s), F(t)) - F(s) F(t)$ . For more complex classes (VC, $L_2$ -bounded RKHS balls), Donsker's theorem requires the class to be measurable and have finite uniform entropy.

The practical consequence: bounding the finite-sample supremum of $G_n$ via chaining is essentially the same as bounding the supremum of the limit Gaussian process $G_P$ . Dudley's entropy integral works for both.

Proof Sketch

The proof has three pieces: (1) finite-dimensional convergence by the classical multivariate CLT applied to any finite collection $(G_n(f_1), \ldots, G_n(f_k))$ ; (2) uniform asymptotic equicontinuity, which follows from finite uniform entropy (the Donsker class condition); (3) tightness in $\ell^\infty(\mathcal{F})$ via the equicontinuity, by an Arzelà-Ascoli style argument. Putting the pieces together gives weak convergence in the function-space sense.

Why It Matters

Donsker's theorem is the reason that learning theory results derived for Gaussian processes (Sudakov-Fernique, generic chaining, Borell-TIS concentration) transfer to empirical processes. The two are the same object in the limit, modulo finite-sample corrections.

Concretely, this is what licenses using $\sqrt{n}$ -rate Gaussian-process heuristics in finite-sample analysis. Bracketing entropy (van der Vaart- Wellner Sec. 2.7) gives a more refined version that also handles VC and parametric classes.

Failure Mode

Donsker's theorem requires the class to be Donsker, which excludes sufficiently rich classes (e.g., the class of all measurable indicator functions, or VC classes of dimension growing with $n$ ). For non-Donsker classes, you fall back to direct chaining bounds without an asymptotic GP limit.

report a correction →

Neural Networks as Gaussian Processes

The neural-network-Gaussian-process correspondence is one of the cleanest applications of stochastic-process theory to deep learning. At initialization, an infinitely wide neural network with i.i.d. Gaussian weights is a Gaussian process, with a kernel computable in closed form.

Theorem

Neural Network Gaussian Process Correspondence

Statement

Let $f^{(L)}(x; \theta) = W^{(L)} \phi(W^{(L-1)} \phi(\cdots \phi(W^{(1)} x + b^{(1)}) \cdots) + b^{(L-1)}) + b^{(L)}$ be a fully-connected network with the random initialization above. As all hidden widths $m_l \to \infty$ jointly, the prelogit output converges in distribution (over the random init) to a centered Gaussian process $f^{(L)} \sim \mathcal{GP}(0, k^{(L)})$ where the NNGP kernel $k^{(L)}$ is defined recursively by:

$k^{(0)}(x, x') = \sigma_w^2 \cdot \frac{x^\top x'}{d_0} + \sigma_b^2$

$k^{(l)}(x, x') = \sigma_w^2 \cdot \mathbb{E}_{(u, v) \sim \mathcal{N}(0, K^{(l-1)})}\!\bigl[\phi(u) \phi(v)\bigr] + \sigma_b^2$

where $K^{(l-1)}$ is the 2×2 covariance matrix $\bigl[\begin{smallmatrix} k^{(l-1)}(x,x) & k^{(l-1)}(x, x') \\ k^{(l-1)}(x',x) & k^{(l-1)}(x',x') \end{smallmatrix}\bigr]$ .

Source: Neal (1996) for one hidden layer; Lee, Bahri, Novak, Schoenholz, Pennington, Sohl-Dickstein (2018), "Deep Neural Networks as Gaussian Processes," ICLR; Matthews et al. (2018), "Gaussian Process Behaviour in Wide Deep Neural Networks," ICLR.

Intuition

At each hidden layer, the activations are sums of $m$ i.i.d. terms (one per neuron). The classical CLT says the sum, scaled by $1/\sqrt{m}$ , converges to a Gaussian whose variance equals the per-term variance. This is the key step: at infinite width, every layer's preactivations are Gaussian, and the covariance is determined by the previous layer's covariance through the recursion above.

The recursion is computable in closed form for ReLU (Cho-Saul 2009 arccosine kernel) and analytically for many activations.

Proof Sketch

Iterate the central limit theorem layer by layer. At layer $l$ , conditional on the layer- $(l-1)$ activations being Gaussian with covariance $K^{(l-1)}$ , the preactivations at layer $l$ are sums of $m_{l-1}$ i.i.d. terms. By multivariate CLT (with vanishing higher moments under the Lipschitz condition), the limit is Gaussian with covariance $\sigma_w^2 \mathbb{E}[\phi(u)\phi(v)] + \sigma_b^2$ . The technical core is showing the convergence is uniform over inputs $x, x'$ , which requires controlling the $L^2$ continuity of $\phi$ and the boundedness of $K^{(l-1)}$ . See Lee et al. (2018) Theorem 1 for the formal statement.

Why It Matters

The NNGP correspondence enables exact Bayesian inference with infinite-width networks: you compute the NNGP kernel for your architecture, then do GP regression with that kernel. The posterior is in closed form.

The closely related neural tangent kernel captures the training dynamics of infinite-width networks: under gradient flow with infinitesimal learning rate, the NTK is constant during training, and the trained network's predictions equal NTK kernel ridge regression. NNGP is for inference at init; NTK is for training trajectories.

This gives a complete theory of infinite-width networks: they are Gaussian processes both before and (in a different sense) after training. The hard open question is what finite width changes, since real networks have features that infinite-width theory cannot model.

Failure Mode

The correspondence holds in the infinite-width limit. At finite width, correction terms appear; the leading correction is $O(1/m)$ for kernel quantities. For modern transformers (where the relevant scaling involves sequence length and depth, not just width), the NNGP correspondence is more delicate; see Yang's "Tensor Programs" series for the modern formulation.

The correspondence does not capture feature learning: in the NNGP/NTK limits, the network's intermediate representations are frozen at their initialization values. Real networks learn features, which is why finite- width theory matters for explaining modern deep learning.

report a correction →

Martingales: Concentration for Adapted Sequences

Empirical processes assume i.i.d. sampling. Many ML problems involve adapted data: the next sample depends on the past (online learning, RL, sequential experiments). Martingales are the right generalization.

Definition

Martingale $(M_{n}, F_{n})$

A sequence $(M_n)_{n \geq 0}$ adapted to a filtration $(\mathcal{F}_n)_{n \geq 0}$ is a martingale if for all $n \geq 0$ :

$\mathbb{E}[|M_n|] < \infty \quad \text{and} \quad \mathbb{E}[M_{n+1} \mid \mathcal{F}_n] = M_n$

The increments $D_n = M_n - M_{n-1}$ form a martingale difference sequence (MDS) with $\mathbb{E}[D_n \mid \mathcal{F}_{n-1}] = 0$ .

Theorem

Azuma-Hoeffding for Bounded MDS

Statement

Let $(D_n)_{n \geq 1}$ be a martingale difference sequence with $|D_n| \leq b_n$ a.s. for all $n$ . Let $S_n = \sum_{i=1}^n D_i$ . Then for all $t > 0$ :

$\mathbb{P}(|S_n| \geq t) \leq 2 \exp\!\Bigl(-\frac{t^2}{2 \sum_{i=1}^n b_i^2}\Bigr)$

Source: Azuma (1967), "Weighted sums of certain dependent random variables," Tohoku Math. J. 19; Hoeffding (1963), Theorem 2.

Intuition

This is Hoeffding's inequality without the independence assumption: the martingale difference structure ( $\mathbb{E}[D_i \mid \mathcal{F}_{i-1}] = 0$ ) replaces independence and gives the same sub-Gaussian tail. The constant $\sum b_i^2$ is the "effective variance bound" in the worst case; Bernstein- type tightening (Freedman's inequality) replaces it with the conditional variance $\sum \mathbb{E}[D_i^2 \mid \mathcal{F}_{i-1}]$ .

This is what powers concentration for adaptive ML algorithms: in online learning, the cumulative regret is a martingale by construction (each new loss is centered conditional on the algorithm's prediction), so Azuma- Hoeffding gives anytime regret bounds.

Proof Sketch

The proof mirrors scalar Hoeffding. For each $i$ , the conditional MGF satisfies $\mathbb{E}[e^{\lambda D_i} \mid \mathcal{F}_{i-1}] \leq e^{\lambda^2 b_i^2 / 2}$ by Hoeffding's lemma. Then $\mathbb{E}[e^{\lambda S_n}] = \mathbb{E}[e^{\lambda S_{n-1}} \mathbb{E}[e^{\lambda D_n} \mid \mathcal{F}_{n-1}]] \leq e^{\lambda^2 b_n^2/2} \mathbb{E}[e^{\lambda S_{n-1}}]$ , iterate over $n$ , optimize $\lambda$ . The inner conditioning step is exactly where independence is replaced by the MDS property.

Why It Matters

McDiarmid's inequality (the workhorse for generalization bounds) is proved by applying Azuma-Hoeffding to the Doob martingale of any bounded-differences function.
Online learning regret bounds: Hedge, EXP3, UCB all rely on Azuma-Hoeffding to control cumulative deviations.
Stochastic approximation / SGD convergence: noisy gradient terms form an MDS, and Azuma-Hoeffding gives anytime convergence rates.
Doob martingales for sums under arbitrary dependence: when $f(X_1, \ldots, X_n)$ has bounded differences in each coordinate, the process $M_k = \mathbb{E}[f \mid X_1, \ldots, X_k]$ is a bounded MDS, giving McDiarmid's $\exp(-2t^2 / \sum c_i^2)$ tail.

Failure Mode

Azuma-Hoeffding is loose by a factor of 4 compared to McDiarmid's classical form for bounded-differences functions, due to the variance bound $\sum b_i^2$ replacing $\sum (b_i^+ - b_i^-)^2$ . For tighter rates, use Freedman's inequality (which replaces $b_i^2$ by predictable quadratic variation) when an MGF moment bound conditional on the filtration is available.

report a correction →

Reproducing Kernel Hilbert Space Connection

Every PSD kernel $k$ defines both a Gaussian process and a reproducing kernel Hilbert space (RKHS) $\mathcal{H}_k$ . These are dual representations of the same kernel:

GP view: $k$ generates random functions $f \sim \mathcal{GP}(0, k)$ . Sample paths typically lie outside $\mathcal{H}_k$ with probability one.
RKHS view: $k$ generates a deterministic Hilbert space of functions. Each $f \in \mathcal{H}_k$ admits the reproducing property $f(x) = \langle f, k(\cdot, x) \rangle_{\mathcal{H}_k}$ .

The connection: the posterior mean of GP regression with kernel $k$ on data $\{(x_i, y_i)\}$ is exactly the kernel ridge regression solution in $\mathcal{H}_k$ :

$\hat{f}_{\text{GP}}(x) = \mathbf{k}_*^\top (K + \sigma_n^2 I)^{-1} \mathbf{y}$

where $K_{ij} = k(x_i, x_j)$ and $(\mathbf{k}_*)_i = k(x_i, x)$ . This is also the minimizer of $\sum_i (y_i - f(x_i))^2 + \sigma_n^2 \|f\|^2_{\mathcal{H}_k}$ over $f \in \mathcal{H}_k$ .

Why the GP samples are outside the RKHS. The RKHS norm penalizes high-frequency components heavily, while GP samples have a typical RKHS norm of $\infty$ (they are "rougher" than RKHS functions). Driscoll's zero-one law states: a GP sample lies in the RKHS with probability either $0$ or $1$ , depending only on the trace condition $\sum_j \lambda_j < \infty$ where $\lambda_j$ are the eigenvalues of $k$ as an integral operator. For "typical" kernels this trace is infinite, so samples lie outside $\mathcal{H}_k$ almost surely. (Lukić-Beder, 2001, gave the cleanest modern statement.)

This is the resolution to a common confusion: the GP defines a prior distribution over functions, but the posterior mean is the RKHS-norm- minimizing fit, and the prior support is not the RKHS itself.

Common Confusions

Watch Out

GP samples are not in the RKHS

Driscoll's theorem: with probability one, samples from $\mathcal{GP}(0, k)$ do not lie in the RKHS $\mathcal{H}_k$ (when the kernel has infinite trace, which is the typical case). Yet the posterior mean of GP regression is the kernel ridge solution in $\mathcal{H}_k$ . The GP prior is a distribution on a larger function space than $\mathcal{H}_k$ ; the posterior concentrates within an RKHS-shaped tube around the data.

Watch Out

NNGP and NTK are different limits

The NNGP kernel describes the prior distribution at random initialization: the network output is $\mathcal{GP}(0, k_{\text{NNGP}})$ before any training. The NTK describes the gradient flow dynamics: under infinitesimal learning rate, the trained network's predictions equal a specific NTK kernel ridge regression. Both apply at infinite width but to different objects (init distribution vs. trained predictor).

Watch Out

Stationary vs. non-stationary kernels

A kernel is stationary if $k(s, t) = \kappa(s - t)$ for some function $\kappa$ . Brownian motion's kernel $k(s, t) = \min(s, t)$ is not stationary; the variance $k(t, t) = t$ grows with $t$ . Brownian bridge's kernel $k(s, t) = \min(s, t) - st$ is also not stationary. RBF and Matérn are stationary. Stationarity matters for spectral analysis (Bochner's theorem) but not for sample-path regularity.

Watch Out

Continuity of paths vs. continuity of moments

A process can have continuous covariance kernel but discontinuous sample paths, and vice versa. Kolmogorov's continuity criterion is a moment-bound sufficient condition for the existence of a continuous modification. The canonical example: a process that equals 0 except at independently chosen random times has discontinuous samples but trivially continuous moments.

Canonical Examples

Example

Squared-exponential kernel: smooth GP

The kernel $k(s, t) = \exp(-\|s - t\|^2 / (2\ell^2))$ has canonical metric $d(s, t) \asymp \|s - t\|/\ell$ . By Kolmogorov continuity (with all moments $\alpha$ ), samples are infinitely Hölder. In fact, samples are real-analytic on any compact set, with high probability.

The corresponding RKHS is the space of band-limited functions with exponentially decaying Fourier transform — a very small Hilbert space. Thus GP samples are far smoother than typical RKHS functions, but the GP posterior mean is in this RKHS.

Example

Matérn kernel: tunable smoothness

The Matérn kernel $k_\nu(s, t) = \frac{2^{1-\nu}}{\Gamma(\nu)} \bigl(\sqrt{2\nu}\,\|s - t\|/\ell\bigr)^\nu K_\nu\!\bigl(\sqrt{2\nu}\,\|s - t\|/\ell\bigr)$ (where $K_\nu$ is the modified Bessel function of the second kind) has samples that are exactly $\nu$ -times mean-square differentiable. At $\nu = 1/2$ , samples are Brownian-motion-rough; at $\nu \to \infty$ , samples become smooth like the squared-exponential.

This makes Matérn the standard kernel in spatial statistics: you choose $\nu$ to match the regularity of the underlying signal.

Example

Empirical CDF as a Brownian bridge

For i.i.d. uniform $X_i \sim U[0,1]$ , the empirical process $G_n(t) = \sqrt{n}(\hat{F}_n(t) - t)$ converges weakly to the Brownian bridge $B^0_t$ with kernel $k(s, t) = \min(s, t) - st$ . This is the original Donsker theorem (1952), and explains why Kolmogorov-Smirnov test statistics are distribution-free.

Exercises

ExerciseCore

Problem

Verify that the Brownian bridge kernel $k(s, t) = \min(s, t) - st$ on $[0,1]$ is positive semidefinite. Compute the canonical metric $d(s, t)$ and explain why a Brownian bridge sample is continuous but not differentiable.

ExerciseAdvanced

Problem

Compute the NNGP kernel for a one-hidden-layer ReLU network with input $x \in \mathbb{R}^d$ and weights $W^{(1)}_{ij} \sim \mathcal{N}(0, \sigma_w^2/d)$ , no biases. Verify it equals the arc-cosine kernel of order 1 (Cho-Saul 2009).

ExerciseAdvanced

Problem

A Doob martingale for $f(X_1, \ldots, X_n)$ is defined by $M_k = \mathbb{E}[f \mid X_1, \ldots, X_k]$ . If $f$ has bounded differences $|f(\ldots, x_i, \ldots) - f(\ldots, x_i', \ldots)| \leq c_i$ , show that the MDS $D_k = M_k - M_{k-1}$ satisfies $|D_k| \leq c_k$ a.s. Apply Azuma- Hoeffding to recover McDiarmid's inequality.

ExerciseResearch

Problem

State Borell-TIS (Tsirelson-Ibragimov-Sudakov) concentration for Gaussian processes: a sub-Gaussian tail for $\sup_t X_t$ around its expectation. Compare to Azuma-Hoeffding for empirical processes; explain why both give the same rate but Borell-TIS has a sharper constant.

References

Canonical:

Adler & Taylor (2007), Random Fields and Geometry, Chapters 1-2 (covariance kernels, sample-path regularity, Borell-TIS)
Karatzas & Shreve (1991), Brownian Motion and Stochastic Calculus, Chapter 2 (Kolmogorov continuity, Brownian motion construction)
Rasmussen & Williams (2006), Gaussian Processes for Machine Learning, Chapters 2 and 4 (GP regression, kernel zoo)
Durrett (2019), Probability: Theory and Examples (5th ed.), Chapters 4-7 (martingales, Brownian motion)
van Handel (2016), Probability in High Dimension, Chapters 5-7 (Borell-TIS, Sudakov-Fernique, Gaussian processes for empirical processes)

Current:

Lee, Bahri, Novak, Schoenholz, Pennington, Sohl-Dickstein (2018), "Deep Neural Networks as Gaussian Processes," ICLR, arXiv:1711.00165
Matthews, Hron, Rowland, Turner, Ghahramani (2018), "Gaussian Process Behaviour in Wide Deep Neural Networks," ICLR, arXiv:1804.11271
Jacot, Gabriel, Hongler (2018), "Neural Tangent Kernel: Convergence and Generalization in Neural Networks," NeurIPS, arXiv:1806.07572
Cho & Saul (2009), "Kernel Methods for Deep Learning," NeurIPS (arc-cosine kernels)

Frontier:

Yang (2019-2021), "Tensor Programs" series (I-V), arXiv:1910.12478, 2006.14548, 2009.10685 — modern formulation of NNGP/NTK for arbitrary architectures including transformers
Roberts, Yaida, Hanin (2022), The Principles of Deep Learning Theory — finite-width $1/m$ corrections to NNGP

Next Topics

Empirical processes and chaining: bounding suprema via covering numbers
Neural tangent kernel: training dynamics of infinite-width networks
Martingale theory: Azuma-Hoeffding, Doob, optional stopping
Brownian motion: the canonical Gaussian process and its path properties

Last reviewed: May 5, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
Measure-Theoretic Probabilitylayer 0B · tier 1
Concentration Inequalitieslayer 1 · tier 1

Derived topics

0

No published topic currently declares this as a prerequisite.