Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Foundations

Sequences and Series of Functions

Pointwise vs uniform convergence of function sequences, the Weierstrass M-test, and why uniform convergence preserves continuity. The concept that makes learning theory work.

CoreTier 2Stable~35 min
0

Why This Matters

Uniform convergence is the single most important concept connecting classical analysis to learning theory. The reason ERM (empirical risk minimization) works is that empirical risk converges uniformly to population risk over a hypothesis class. If the convergence were only pointwise, ERM would have no generalization guarantees. Understanding the distinction between pointwise and uniform convergence is prerequisite for understanding any generalization bound.

Pointwise Convergence

Definition

Pointwise Convergence

A sequence of functions fn:XRf_n: X \to \mathbb{R} converges pointwise to f:XRf: X \to \mathbb{R} if for every xXx \in X:

limnfn(x)=f(x)\lim_{n \to \infty} f_n(x) = f(x)

Equivalently: for every xXx \in X and every ϵ>0\epsilon > 0, there exists N=N(x,ϵ)N = N(x, \epsilon) such that fn(x)f(x)<ϵ|f_n(x) - f(x)| < \epsilon for all nNn \geq N.

The critical detail: NN may depend on xx. Different points may require different numbers of terms to get close to the limit.

Example

Pointwise but not uniform convergence

Let fn(x)=xnf_n(x) = x^n on [0,1][0, 1]. For each x[0,1)x \in [0, 1), fn(x)0f_n(x) \to 0. At x=1x = 1, fn(1)=1f_n(1) = 1 for all nn. So the pointwise limit is:

f(x)={0if x[0,1)1if x=1f(x) = \begin{cases} 0 & \text{if } x \in [0, 1) \\ 1 & \text{if } x = 1 \end{cases}

Each fnf_n is continuous, but the pointwise limit ff is discontinuous. This shows that pointwise convergence does not preserve continuity.

Uniform Convergence

Definition

Uniform Convergence

A sequence fn:XRf_n: X \to \mathbb{R} converges uniformly to f:XRf: X \to \mathbb{R} if:

supxXfn(x)f(x)0 as n\sup_{x \in X} |f_n(x) - f(x)| \to 0 \text{ as } n \to \infty

Equivalently: for every ϵ>0\epsilon > 0, there exists N=N(ϵ)N = N(\epsilon) (independent of xx) such that fn(x)f(x)<ϵ|f_n(x) - f(x)| < \epsilon for all nNn \geq N and all xXx \in X.

The difference from pointwise convergence: in uniform convergence, a single NN works for all xx simultaneously. This "uniformity over xx" is what gives the concept its power.

Main Theorems

Theorem

Uniform Limit of Continuous Functions is Continuous

Statement

If fn:XRf_n: X \to \mathbb{R} is continuous for each nn and fnff_n \to f uniformly on XX, then ff is continuous on XX.

Intuition

Uniform convergence means the entire graph of fnf_n is within an ϵ\epsilon-tube around ff for large nn. Since fnf_n is continuous (no jumps) and is close to ff everywhere simultaneously, ff cannot have jumps either.

Proof Sketch

Fix x0Xx_0 \in X and ϵ>0\epsilon > 0. By uniform convergence, choose NN so that fN(x)f(x)<ϵ/3|f_N(x) - f(x)| < \epsilon/3 for all xx. By continuity of fNf_N, choose δ\delta so that fN(x)fN(x0)<ϵ/3|f_N(x) - f_N(x_0)| < \epsilon/3 when xx0<δ|x - x_0| < \delta. Then by triangle inequality:

f(x)f(x0)f(x)fN(x)+fN(x)fN(x0)+fN(x0)f(x0)<ϵ|f(x) - f(x_0)| \leq |f(x) - f_N(x)| + |f_N(x) - f_N(x_0)| + |f_N(x_0) - f(x_0)| < \epsilon

Why It Matters

This theorem explains why uniform convergence is the right notion for learning theory. When empirical risk converges uniformly to population risk, the "landscape" of risk values is preserved: if a hypothesis has low empirical risk, it must have low population risk. Pointwise convergence would not give this guarantee.

Failure Mode

The theorem fails for pointwise convergence. The example fn(x)=xnf_n(x) = x^n on [0,1][0, 1] gives a discontinuous limit from continuous functions. In learning theory terms: if empirical risk converges to population risk only pointwise (for each fixed hypothesis), the ERM hypothesis could still have high population risk.

The Weierstrass M-Test

Theorem

Weierstrass M-Test

Statement

Let gk:XRg_k: X \to \mathbb{R} satisfy gk(x)Mk|g_k(x)| \leq M_k for all xXx \in X, where k=1Mk<\sum_{k=1}^{\infty} M_k < \infty. Then the series k=1gk(x)\sum_{k=1}^{\infty} g_k(x) converges uniformly and absolutely on XX.

Intuition

If you can bound each term by a constant that does not depend on xx, and these constants form a convergent series, then the function series converges uniformly. The domination by the constant series MkM_k controls the "worst case" over all xx simultaneously.

Proof Sketch

For any xx, k=nmgk(x)k=nmMk|\sum_{k=n}^{m} g_k(x)| \leq \sum_{k=n}^{m} M_k. Since Mk\sum M_k converges, its tail k=nMk0\sum_{k=n}^{\infty} M_k \to 0. Therefore supxk=nmgk(x)0\sup_x |\sum_{k=n}^{m} g_k(x)| \to 0 as n,mn, m \to \infty, giving uniform Cauchy and hence uniform convergence by completeness of R\mathbb{R}.

Why It Matters

The M-test is the standard tool for proving that series of functions (e.g., Fourier series, power series, series expansions of kernels) converge uniformly. In ML, it appears when establishing that certain function approximations converge uniformly, which is needed for generalization guarantees.

Failure Mode

The M-test is sufficient but not necessary. A series may converge uniformly even when no dominating summable sequence MkM_k exists. The M-test also requires pointwise bounds that are independent of xx; if the bounds grow with xx (e.g., on an unbounded domain), the test does not apply directly.

Connection to Learning Theory

In statistical learning theory, define:

  • R(h)=E[(h(x),y)]R(h) = \mathbb{E}[\ell(h(x), y)] (population risk)
  • R^n(h)=1ni=1n(h(xi),yi)\hat{R}_n(h) = \frac{1}{n} \sum_{i=1}^n \ell(h(x_i), y_i) (empirical risk)

The function hR^n(h)h \mapsto \hat{R}_n(h) is a random function that approximates hR(h)h \mapsto R(h).

Pointwise convergence: for each fixed hh, R^n(h)R(h)\hat{R}_n(h) \to R(h) by the law of large numbers. This is not useful for ERM because the ERM hypothesis hERMh_{\text{ERM}} depends on the data.

Uniform convergence: suphHR^n(h)R(h)0\sup_{h \in \mathcal{H}} |\hat{R}_n(h) - R(h)| \to 0. This guarantees that the ERM hypothesis has population risk close to its empirical risk, which is the foundation of generalization bounds.

The entire program of VC theory, Rademacher complexity, and covering numbers is devoted to establishing conditions under which this uniform convergence holds.

Common Confusions

Watch Out

Pointwise convergence is not useless, just insufficient for ERM

The law of large numbers gives pointwise convergence of empirical risk to population risk for free. The hard work in learning theory is upgrading this to uniform convergence over the hypothesis class. The gap between pointwise and uniform is exactly the gap between "each hypothesis generalizes" and "the selected hypothesis generalizes."

Watch Out

Uniform convergence is not about rate, it is about uniformity

A sequence can converge uniformly at a slow rate or pointwise at a fast rate. The distinction is not about speed but about whether a single NN works for all xx. In learning theory, the rate matters too (it determines sample complexity), but the uniformity is the conceptual breakthrough.

Summary

  • Pointwise: for each xx, fn(x)f(x)f_n(x) \to f(x). The convergence speed may vary with xx
  • Uniform: supxfn(x)f(x)0\sup_x |f_n(x) - f(x)| \to 0. One NN works for all xx
  • Uniform convergence preserves continuity; pointwise does not
  • The Weierstrass M-test: if gk(x)Mk|g_k(x)| \leq M_k and Mk<\sum M_k < \infty, then gk\sum g_k converges uniformly
  • Uniform convergence of empirical risk to population risk is the foundation of learning theory

Exercises

ExerciseCore

Problem

Let fn(x)=nx1+n2x2f_n(x) = \frac{nx}{1 + n^2 x^2} on [0,1][0, 1]. Show that fn0f_n \to 0 pointwise. Does fn0f_n \to 0 uniformly?

ExerciseAdvanced

Problem

Explain why the law of large numbers alone is insufficient to prove that ERM generalizes. What additional property of the hypothesis class is needed, and how does it relate to uniform convergence?

Related Comparisons

References

Canonical:

  • Rudin, Principles of Mathematical Analysis, Chapter 7
  • Shalev-Shwartz and Ben-David, Understanding Machine Learning, Chapter 4 (uniform convergence in learning)

Current:

  • Wainwright, High-Dimensional Statistics, Chapter 4 (uniform laws of large numbers)

  • Munkres, Topology (2000), Chapter 1 (set theory review)

Next Topics

  • Empirical risk minimization: where uniform convergence meets learning theory
  • Uniform convergence: the formal learning-theoretic framework

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics