Concentration Probability
Stochastic Processes for ML
Random processes as the unifying framework behind Gaussian processes, neural tangent kernels, empirical processes, and martingale concentration: definitions, sample-path regularity, and the bridges between learning theory's main probabilistic tools.
Prerequisites
Why This Matters
A surprising amount of modern learning theory is one branch of stochastic- process theory under different names. Gaussian processes are random functions indexed by inputs. Empirical processes are random functions indexed by hypotheses. The neural tangent kernel is the Gaussian process limit of an infinitely wide neural network at initialization. McDiarmid's inequality and SGD convergence both rely on martingale concentration.
These look like separate topics until you see the underlying object: a random function on an index set , with covariance controlling the geometry of the process. Once you internalize that, results transfer between the views without re-deriving them.
Mental Model
A stochastic process is a probability distribution over the function space . You can take three different perspectives on the same object, and learning theory uses all three:
- Pointwise (marginal): at each fixed , is a real-valued random variable. Most concentration tools (Hoeffding, Bernstein) apply here.
- Path-wise (sample function): a single draw is a function . Smoothness, continuity, and Hölder regularity apply here.
- Spectral (covariance kernel): the second-order structure encodes everything for Gaussian processes and most of what matters for learning theory.
Switching between these views is the core skill. The empirical-process supremum is path-wise; bounding it via Dudley's integral uses the spectral view (covering numbers in the canonical metric); proving the bound is tight uses Sudakov's pointwise comparison.
Formal Setup
Stochastic Process
A stochastic process indexed by a set on a probability space is a collection of random variables , each measurable into a state space (here always ). When the process is in continuous time; when the process is a random field; when a function class, the process is an empirical process.
Covariance Kernel
For a centered process (), the covariance kernel is:
The kernel is symmetric () and positive semidefinite (any -by- Gram matrix is PSD). By Bochner-Mercer-style theorems, every PSD kernel arises as the covariance of some stochastic process.
The canonical metric induced by the process is:
For a centered process, the canonical metric depends only on .
Gaussian Process
A stochastic process is a Gaussian process with mean zero and covariance if for any finite collection , the random vector is jointly Gaussian with covariance matrix .
By Kolmogorov's extension theorem, any PSD kernel on defines a Gaussian process; the existence is automatic. The hard question is sample-path regularity: when is a draw from this process continuous, differentiable, or Hölder?
Sample-Path Regularity
A Gaussian process exists abstractly given a PSD kernel. But whether you can realize it with continuous sample paths is a separate question, controlled by how fast the canonical metric grows.
Kolmogorov Continuity Criterion
Statement
Let be a stochastic process such that for some :
Then admits a modification (i.e., a.s. for each ) with sample paths that are locally -Hölder continuous for every .
Source: Karatzas-Shreve (1991), Theorem 2.2.8; Stroock-Varadhan (1979), Theorem 2.1.6.
Intuition
The criterion translates a moment bound on increments into a path-regularity guarantee. Brownian motion has Gaussian increments with for any , so taking large gives continuity with — Brownian paths are -Hölder for every , and not better (this is sharp).
For a Gaussian process with kernel satisfying , all moments are controlled by the variance, and Kolmogorov gives continuity with Hölder exponent up to .
Proof Sketch
Discretize into a dyadic grid at scale . The number of grid points is , and each adjacent pair has increment moment bounded by . Markov plus union bound gives a uniform increment bound at each scale, summable in for . The resulting series is finite a.s., and the partial sums define a continuous modification.
Why It Matters
This is the bridge from spectral information (a kernel decay rate) to path-wise information (Hölder regularity of GP samples). For the squared- exponential kernel , the canonical metric satisfies , so samples are infinitely Hölder. For the Matérn kernel with smoothness , samples are exactly -Hölder.
This determines whether GP regression with a given kernel can recover functions in a corresponding Hölder class, and at what rate.
Failure Mode
The criterion is sufficient, not necessary. A process can be continuous without satisfying the moment bound (e.g., almost-surely-constant processes). Conversely, a process whose increment moments fail the criterion may have a discontinuous modification but still be continuous along specific sub-sequences. For non-Gaussian processes with heavy-tailed increments, the criterion can give weaker continuity than the truth.
Brownian Motion: The Canonical Gaussian Process
Brownian motion (the Wiener process) is the prototypical Gaussian process: it has the simplest possible kernel and yet exhibits all the path-regularity phenomena that complicate general processes.
A standard Brownian motion is the centered Gaussian process with covariance:
The canonical metric is , growing as the square root of . By Kolmogorov continuity, has a continuous modification with paths that are -Hölder.
Why this is canonical. Many stochastic processes used in ML can be written as functionals of Brownian motion:
- Ornstein-Uhlenbeck process: is a stationary GP with exponential kernel . This is the Matérn-1/2 kernel and the noise process used in score-matching diffusion models.
- Brownian bridge: on has kernel and is the limit process for empirical CDFs (Donsker's theorem).
- Fractional Brownian motion: GP with for . Reduces to Brownian motion at .
Quadratic variation. For Brownian motion, in as the partition mesh . This is why Itô calculus has a second-order correction term: ordinary Riemann integration fails because has nontrivial variance even at infinitesimal scale.
Connection to Empirical Processes
An empirical process over a function class is a stochastic process indexed by . The covariance kernel is:
The canonical metric is .
Under regularity, the empirical process converges (after centering) to a Gaussian process indexed by with this covariance:
Donsker's Theorem (Functional CLT)
Statement
Let be a Donsker class for . Then the empirical process converges in distribution to a centered Gaussian process indexed by with covariance :
Source: van der Vaart-Wellner (1996), Theorem 2.5.2; classical case (empirical CDF, ) is the original Donsker theorem (1952), with the Brownian bridge.
Intuition
This is the function-space analog of the central limit theorem. For the empirical CDF on the real line, the limit process is the Brownian bridge, which has covariance . For more complex classes (VC, -bounded RKHS balls), Donsker's theorem requires the class to be measurable and have finite uniform entropy.
The practical consequence: bounding the finite-sample supremum of via chaining is essentially the same as bounding the supremum of the limit Gaussian process . Dudley's entropy integral works for both.
Proof Sketch
The proof has three pieces: (1) finite-dimensional convergence by the classical multivariate CLT applied to any finite collection ; (2) uniform asymptotic equicontinuity, which follows from finite uniform entropy (the Donsker class condition); (3) tightness in via the equicontinuity, by an Arzelà-Ascoli style argument. Putting the pieces together gives weak convergence in the function-space sense.
Why It Matters
Donsker's theorem is the reason that learning theory results derived for Gaussian processes (Sudakov-Fernique, generic chaining, Borell-TIS concentration) transfer to empirical processes. The two are the same object in the limit, modulo finite-sample corrections.
Concretely, this is what licenses using -rate Gaussian-process heuristics in finite-sample analysis. Bracketing entropy (van der Vaart- Wellner Sec. 2.7) gives a more refined version that also handles VC and parametric classes.
Failure Mode
Donsker's theorem requires the class to be Donsker, which excludes sufficiently rich classes (e.g., the class of all measurable indicator functions, or VC classes of dimension growing with ). For non-Donsker classes, you fall back to direct chaining bounds without an asymptotic GP limit.
Neural Networks as Gaussian Processes
The neural-network-Gaussian-process correspondence is one of the cleanest applications of stochastic-process theory to deep learning. At initialization, an infinitely wide neural network with i.i.d. Gaussian weights is a Gaussian process, with a kernel computable in closed form.
Neural Network Gaussian Process Correspondence
Statement
Let be a fully-connected network with the random initialization above. As all hidden widths jointly, the prelogit output converges in distribution (over the random init) to a centered Gaussian process where the NNGP kernel is defined recursively by:
where is the 2×2 covariance matrix .
Source: Neal (1996) for one hidden layer; Lee, Bahri, Novak, Schoenholz, Pennington, Sohl-Dickstein (2018), "Deep Neural Networks as Gaussian Processes," ICLR; Matthews et al. (2018), "Gaussian Process Behaviour in Wide Deep Neural Networks," ICLR.
Intuition
At each hidden layer, the activations are sums of i.i.d. terms (one per neuron). The classical CLT says the sum, scaled by , converges to a Gaussian whose variance equals the per-term variance. This is the key step: at infinite width, every layer's preactivations are Gaussian, and the covariance is determined by the previous layer's covariance through the recursion above.
The recursion is computable in closed form for ReLU (Cho-Saul 2009 arccosine kernel) and analytically for many activations.
Proof Sketch
Iterate the central limit theorem layer by layer. At layer , conditional on the layer- activations being Gaussian with covariance , the preactivations at layer are sums of i.i.d. terms. By multivariate CLT (with vanishing higher moments under the Lipschitz condition), the limit is Gaussian with covariance . The technical core is showing the convergence is uniform over inputs , which requires controlling the continuity of and the boundedness of . See Lee et al. (2018) Theorem 1 for the formal statement.
Why It Matters
The NNGP correspondence enables exact Bayesian inference with infinite-width networks: you compute the NNGP kernel for your architecture, then do GP regression with that kernel. The posterior is in closed form.
The closely related neural tangent kernel captures the training dynamics of infinite-width networks: under gradient flow with infinitesimal learning rate, the NTK is constant during training, and the trained network's predictions equal NTK kernel ridge regression. NNGP is for inference at init; NTK is for training trajectories.
This gives a complete theory of infinite-width networks: they are Gaussian processes both before and (in a different sense) after training. The hard open question is what finite width changes, since real networks have features that infinite-width theory cannot model.
Failure Mode
The correspondence holds in the infinite-width limit. At finite width, correction terms appear; the leading correction is for kernel quantities. For modern transformers (where the relevant scaling involves sequence length and depth, not just width), the NNGP correspondence is more delicate; see Yang's "Tensor Programs" series for the modern formulation.
The correspondence does not capture feature learning: in the NNGP/NTK limits, the network's intermediate representations are frozen at their initialization values. Real networks learn features, which is why finite- width theory matters for explaining modern deep learning.
Martingales: Concentration for Adapted Sequences
Empirical processes assume i.i.d. sampling. Many ML problems involve adapted data: the next sample depends on the past (online learning, RL, sequential experiments). Martingales are the right generalization.
Martingale
A sequence adapted to a filtration is a martingale if for all :
The increments form a martingale difference sequence (MDS) with .
Azuma-Hoeffding for Bounded MDS
Statement
Let be a martingale difference sequence with a.s. for all . Let . Then for all :
Source: Azuma (1967), "Weighted sums of certain dependent random variables," Tohoku Math. J. 19; Hoeffding (1963), Theorem 2.
Intuition
This is Hoeffding's inequality without the independence assumption: the martingale difference structure () replaces independence and gives the same sub-Gaussian tail. The constant is the "effective variance bound" in the worst case; Bernstein- type tightening (Freedman's inequality) replaces it with the conditional variance .
This is what powers concentration for adaptive ML algorithms: in online learning, the cumulative regret is a martingale by construction (each new loss is centered conditional on the algorithm's prediction), so Azuma- Hoeffding gives anytime regret bounds.
Proof Sketch
The proof mirrors scalar Hoeffding. For each , the conditional MGF satisfies by Hoeffding's lemma. Then , iterate over , optimize . The inner conditioning step is exactly where independence is replaced by the MDS property.
Why It Matters
- McDiarmid's inequality (the workhorse for generalization bounds) is proved by applying Azuma-Hoeffding to the Doob martingale of any bounded-differences function.
- Online learning regret bounds: Hedge, EXP3, UCB all rely on Azuma-Hoeffding to control cumulative deviations.
- Stochastic approximation / SGD convergence: noisy gradient terms form an MDS, and Azuma-Hoeffding gives anytime convergence rates.
- Doob martingales for sums under arbitrary dependence: when has bounded differences in each coordinate, the process is a bounded MDS, giving McDiarmid's tail.
Failure Mode
Azuma-Hoeffding is loose by a factor of 4 compared to McDiarmid's classical form for bounded-differences functions, due to the variance bound replacing . For tighter rates, use Freedman's inequality (which replaces by predictable quadratic variation) when an MGF moment bound conditional on the filtration is available.
Reproducing Kernel Hilbert Space Connection
Every PSD kernel defines both a Gaussian process and a reproducing kernel Hilbert space (RKHS) . These are dual representations of the same kernel:
- GP view: generates random functions . Sample paths typically lie outside with probability one.
- RKHS view: generates a deterministic Hilbert space of functions. Each admits the reproducing property .
The connection: the posterior mean of GP regression with kernel on data is exactly the kernel ridge regression solution in :
where and . This is also the minimizer of over .
Why the GP samples are outside the RKHS. The RKHS norm penalizes high-frequency components heavily, while GP samples have a typical RKHS norm of (they are "rougher" than RKHS functions). Driscoll's zero-one law states: a GP sample lies in the RKHS with probability either or , depending only on the trace condition where are the eigenvalues of as an integral operator. For "typical" kernels this trace is infinite, so samples lie outside almost surely. (Lukić-Beder, 2001, gave the cleanest modern statement.)
This is the resolution to a common confusion: the GP defines a prior distribution over functions, but the posterior mean is the RKHS-norm- minimizing fit, and the prior support is not the RKHS itself.
Common Confusions
GP samples are not in the RKHS
Driscoll's theorem: with probability one, samples from do not lie in the RKHS (when the kernel has infinite trace, which is the typical case). Yet the posterior mean of GP regression is the kernel ridge solution in . The GP prior is a distribution on a larger function space than ; the posterior concentrates within an RKHS-shaped tube around the data.
NNGP and NTK are different limits
The NNGP kernel describes the prior distribution at random initialization: the network output is before any training. The NTK describes the gradient flow dynamics: under infinitesimal learning rate, the trained network's predictions equal a specific NTK kernel ridge regression. Both apply at infinite width but to different objects (init distribution vs. trained predictor).
Stationary vs. non-stationary kernels
A kernel is stationary if for some function . Brownian motion's kernel is not stationary; the variance grows with . Brownian bridge's kernel is also not stationary. RBF and Matérn are stationary. Stationarity matters for spectral analysis (Bochner's theorem) but not for sample-path regularity.
Continuity of paths vs. continuity of moments
A process can have continuous covariance kernel but discontinuous sample paths, and vice versa. Kolmogorov's continuity criterion is a moment-bound sufficient condition for the existence of a continuous modification. The canonical example: a process that equals 0 except at independently chosen random times has discontinuous samples but trivially continuous moments.
Canonical Examples
Squared-exponential kernel: smooth GP
The kernel has canonical metric . By Kolmogorov continuity (with all moments ), samples are infinitely Hölder. In fact, samples are real-analytic on any compact set, with high probability.
The corresponding RKHS is the space of band-limited functions with exponentially decaying Fourier transform — a very small Hilbert space. Thus GP samples are far smoother than typical RKHS functions, but the GP posterior mean is in this RKHS.
Matérn kernel: tunable smoothness
The Matérn kernel (where is the modified Bessel function of the second kind) has samples that are exactly -times mean-square differentiable. At , samples are Brownian-motion-rough; at , samples become smooth like the squared-exponential.
This makes Matérn the standard kernel in spatial statistics: you choose to match the regularity of the underlying signal.
Empirical CDF as a Brownian bridge
For i.i.d. uniform , the empirical process converges weakly to the Brownian bridge with kernel . This is the original Donsker theorem (1952), and explains why Kolmogorov-Smirnov test statistics are distribution-free.
Exercises
Problem
Verify that the Brownian bridge kernel on is positive semidefinite. Compute the canonical metric and explain why a Brownian bridge sample is continuous but not differentiable.
Problem
Compute the NNGP kernel for a one-hidden-layer ReLU network with input and weights , no biases. Verify it equals the arc-cosine kernel of order 1 (Cho-Saul 2009).
Problem
A Doob martingale for is defined by . If has bounded differences , show that the MDS satisfies a.s. Apply Azuma- Hoeffding to recover McDiarmid's inequality.
Problem
State Borell-TIS (Tsirelson-Ibragimov-Sudakov) concentration for Gaussian processes: a sub-Gaussian tail for around its expectation. Compare to Azuma-Hoeffding for empirical processes; explain why both give the same rate but Borell-TIS has a sharper constant.
References
Canonical:
- Adler & Taylor (2007), Random Fields and Geometry, Chapters 1-2 (covariance kernels, sample-path regularity, Borell-TIS)
- Karatzas & Shreve (1991), Brownian Motion and Stochastic Calculus, Chapter 2 (Kolmogorov continuity, Brownian motion construction)
- Rasmussen & Williams (2006), Gaussian Processes for Machine Learning, Chapters 2 and 4 (GP regression, kernel zoo)
- Durrett (2019), Probability: Theory and Examples (5th ed.), Chapters 4-7 (martingales, Brownian motion)
- van Handel (2016), Probability in High Dimension, Chapters 5-7 (Borell-TIS, Sudakov-Fernique, Gaussian processes for empirical processes)
Current:
- Lee, Bahri, Novak, Schoenholz, Pennington, Sohl-Dickstein (2018), "Deep Neural Networks as Gaussian Processes," ICLR, arXiv:1711.00165
- Matthews, Hron, Rowland, Turner, Ghahramani (2018), "Gaussian Process Behaviour in Wide Deep Neural Networks," ICLR, arXiv:1804.11271
- Jacot, Gabriel, Hongler (2018), "Neural Tangent Kernel: Convergence and Generalization in Neural Networks," NeurIPS, arXiv:1806.07572
- Cho & Saul (2009), "Kernel Methods for Deep Learning," NeurIPS (arc-cosine kernels)
Frontier:
- Yang (2019-2021), "Tensor Programs" series (I-V), arXiv:1910.12478, 2006.14548, 2009.10685 — modern formulation of NNGP/NTK for arbitrary architectures including transformers
- Roberts, Yaida, Hanin (2022), The Principles of Deep Learning Theory — finite-width corrections to NNGP
Next Topics
- Empirical processes and chaining: bounding suprema via covering numbers
- Neural tangent kernel: training dynamics of infinite-width networks
- Martingale theory: Azuma-Hoeffding, Doob, optional stopping
- Brownian motion: the canonical Gaussian process and its path properties
Last reviewed: May 5, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
- Measure-Theoretic Probabilitylayer 0B · tier 1
- Concentration Inequalitieslayer 1 · tier 1
Derived topics
0No published topic currently declares this as a prerequisite.