Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

Pointwise vs. Uniform Convergence

Pointwise convergence allows different rates at different points. Uniform convergence requires the same rate everywhere. Learning theory needs uniform convergence because ERM must work simultaneously for all hypotheses.

What Each Measures

Both describe how a sequence of functions fnf_n approaches a limit function ff. They differ in whether the convergence rate is allowed to depend on the input point.

Pointwise convergence: for each fixed xx, fn(x)f(x)f_n(x) \to f(x) as nn \to \infty. The speed of convergence can vary across different xx.

Uniform convergence: fn(x)f(x)f_n(x) \to f(x) at the same rate for all xx simultaneously. The convergence is controlled by supxfn(x)f(x)\sup_x |f_n(x) - f(x)|.

Side-by-Side Statement

Definition

Pointwise Convergence

A sequence fn:XRf_n: \mathcal{X} \to \mathbb{R} converges pointwise to ff if:

xX,ϵ>0,N(x,ϵ):nNfn(x)f(x)<ϵ\forall x \in \mathcal{X}, \forall \epsilon > 0, \exists N(x, \epsilon): n \geq N \Rightarrow |f_n(x) - f(x)| < \epsilon

The threshold NN is allowed to depend on xx.

Definition

Uniform Convergence

A sequence fn:XRf_n: \mathcal{X} \to \mathbb{R} converges uniformly to ff if:

ϵ>0,N(ϵ):nNsupxXfn(x)f(x)<ϵ\forall \epsilon > 0, \exists N(\epsilon): n \geq N \Rightarrow \sup_{x \in \mathcal{X}} |f_n(x) - f(x)| < \epsilon

The threshold NN depends only on ϵ\epsilon, not on xx.

The quantifier order is the key difference. Pointwise: "for all xx, there exists NN" (different NN per xx). Uniform: "there exists NN such that for all xx" (one NN works everywhere).

Where Each Is Stronger

Pointwise convergence is easier to establish

Any uniformly convergent sequence is pointwise convergent, but not vice versa. Pointwise convergence only requires checking each xx individually.

Uniform convergence preserves more structure

Uniform convergence preserves continuity: if each fnf_n is continuous and fnff_n \to f uniformly, then ff is continuous. Pointwise convergence does not guarantee this. Uniform convergence also allows interchange of limits with integration and differentiation under mild conditions.

The Classic Counterexample

Consider fn(x)=xnf_n(x) = x^n on [0,1][0, 1]. For each x[0,1)x \in [0, 1), xn0x^n \to 0. At x=1x = 1, xn=1x^n = 1 for all nn. So pointwise:

f(x)={0if x[0,1)1if x=1f(x) = \begin{cases} 0 & \text{if } x \in [0, 1) \\ 1 & \text{if } x = 1 \end{cases}

Each fnf_n is continuous, but the pointwise limit ff is discontinuous. The convergence is not uniform: supx[0,1]xnf(x)=supx[0,1)xn\sup_{x \in [0,1]} |x^n - f(x)| = \sup_{x \in [0,1)} x^n, and this supremum is 1 for all nn (take xx close to 1). So the uniform distance never shrinks, even though pointwise convergence holds everywhere.

Why This Matters for Learning Theory

In learning theory, the connection to ERM makes this distinction critical. Consider:

R^n(h)=1ni=1n(h(xi),yi)R(h)=E[(h(x),y)]\hat{R}_n(h) = \frac{1}{n}\sum_{i=1}^n \ell(h(x_i), y_i) \qquad R(h) = \mathbb{E}[\ell(h(x), y)]

By the law of large numbers, for each fixed hh, R^n(h)R(h)\hat{R}_n(h) \to R(h) as nn \to \infty. This is pointwise convergence over the hypothesis class H\mathcal{H} (think of each hh as a "point").

But ERM selects h^=argminhHR^n(h)\hat{h} = \arg\min_{h \in \mathcal{H}} \hat{R}_n(h), which depends on nn. To guarantee that R(h^)R(\hat{h}) is small, we need:

suphHR(h)R^n(h)0\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)| \to 0

This is uniform convergence over H\mathcal{H}. Without it, ERM can select a hypothesis that happens to have low empirical risk by luck (overfitting), with population risk far from the empirical risk.

Where Each Fails

Pointwise convergence fails for optimization

If I know that fn(x)f(x)f_n(x) \to f(x) pointwise and I minimize fnf_n, the minimizer of fnf_n need not converge to the minimizer of ff. The minimizer can "chase" the points where convergence is slowest. This is exactly the overfitting phenomenon in ERM.

Uniform convergence can be too strong

For infinite hypothesis classes like neural networks, uniform convergence bounds (VC dimension, Rademacher complexity) can be vacuously large. Modern deep learning generalizes despite the failure of uniform convergence bounds to provide useful guarantees. This has led to research on alternatives: algorithmic stability, PAC-Bayes bounds, and compression-based arguments that do not require uniform convergence.

Key Assumptions That Differ

PointwiseUniform
Rate dependenceCan vary with xxSame for all xx
Preserves continuityNoYes
Allows limit-integral swapNot in generalYes (bounded convergence theorem)
Suffices for ERMNoYes
Complexity measure neededNoneVC dim, Rademacher, covering numbers

When a Researcher Would Use Each

Example

Consistency of an estimator at a fixed parameter

To prove that θ^nθ\hat{\theta}_n \to \theta^* in probability for a fixed true parameter θ\theta^*, pointwise convergence suffices. This is the standard consistency proof for MLE: show that the log-likelihood converges pointwise to its expectation.

Example

Proving generalization bounds for ERM

To bound R(h^ERM)minhR(h)R(\hat{h}_{\text{ERM}}) - \min_{h} R(h), you need suphR(h)R^n(h)0\sup_h |R(h) - \hat{R}_n(h)| \to 0, which is uniform convergence. The rate of this convergence depends on the complexity of H\mathcal{H}.

Example

M-estimation and argmax continuity

When proving that the maximizer of an empirical criterion converges to the maximizer of the population criterion, the standard approach uses uniform convergence of the criterion function. Pointwise convergence of the criterion does not suffice because the argmax is a discontinuous functional.

Common Confusions

Watch Out

Uniform convergence of empirical risk is about the hypothesis class, not the data

The supremum suphHR(h)R^n(h)\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)| is over hypotheses, not data points. A finite hypothesis class always has uniform convergence (by a union bound). An infinite class may or may not, depending on its complexity.

Watch Out

Pointwise convergence plus compactness does not give uniform convergence

A common error: "the hypothesis class is compact, so pointwise convergence implies uniform convergence." This is false in general. You need equicontinuity (Arzela-Ascoli) or similar conditions. For function sequences, Dini's theorem gives uniform convergence on compact sets if the convergence is monotone and the limit is continuous, but these conditions do not always hold.

What to Memorize

  1. Pointwise: x,ϵ,N(x,ϵ)\forall x, \forall \epsilon, \exists N(x, \epsilon). Different NN per point.
  2. Uniform: ϵ,N(ϵ),x\forall \epsilon, \exists N(\epsilon), \forall x. One NN for all points.
  3. ERM needs uniform convergence over the hypothesis class, not just pointwise.
  4. Classic counterexample: fn(x)=xnf_n(x) = x^n on [0,1][0, 1] converges pointwise but not uniformly.
  5. Learning theory implication: the complexity of H\mathcal{H} (VC dim, Rademacher) controls the rate of uniform convergence and therefore the sample complexity of ERM.