Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Universal Approximation Theorem

A single hidden layer neural network can approximate any continuous function on a compact set to arbitrary accuracy. Why this is both important and misleading: it says nothing about width, weight-finding, or generalization.

CoreTier 1Stable~45 min

Why This Matters

The universal approximation theorem (UAT) answers a foundational question: can neural networks represent the functions we need them to represent? The answer is yes, under mild conditions. A single hidden layer with enough neurons can approximate any continuous function on a compact set.

Target f(x)3 neurons10 neurons50 neurons-2-1012More neurons = better approximation. With enough width, any continuous function can be approximated.

But the theorem is frequently misunderstood. It is an existence result, not a constructive one. It guarantees that the right weights exist, not that gradient descent can find them. It guarantees that some width suffices, not that the required width is practical. And it says nothing about generalization. A network that perfectly approximates a function on training data may fail completely on new data.

Formal Statement

Definition

Feedforward Network with One Hidden Layer

A function f:RdRf: \mathbb{R}^d \to \mathbb{R} of the form:

f(x)=j=1Nαjσ(wjTx+bj)f(x) = \sum_{j=1}^{N} \alpha_j \, \sigma(w_j^T x + b_j)

where σ\sigma is the activation function, wjRdw_j \in \mathbb{R}^d are weight vectors, bjRb_j \in \mathbb{R} are biases, αjR\alpha_j \in \mathbb{R} are output weights, and NN is the number of hidden neurons (the width).

Theorem

Universal Approximation (Cybenko, 1989)

Statement

Let σ\sigma be any continuous sigmoidal function (i.e., σ(t)1\sigma(t) \to 1 as t+t \to +\infty and σ(t)0\sigma(t) \to 0 as tt \to -\infty). For any continuous function g:KRg: K \to \mathbb{R} on a compact set KRdK \subset \mathbb{R}^d and any ϵ>0\epsilon > 0, there exists NNN \in \mathbb{N} and parameters {αj,wj,bj}j=1N\{\alpha_j, w_j, b_j\}_{j=1}^N such that:

supxKg(x)j=1Nαjσ(wjTx+bj)<ϵ\sup_{x \in K} \left| g(x) - \sum_{j=1}^{N} \alpha_j \, \sigma(w_j^T x + b_j) \right| < \epsilon

Intuition

Each hidden neuron σ(wjTx+bj)\sigma(w_j^T x + b_j) implements a soft step function in the direction wjw_j. By combining many step functions at different orientations, positions, and scales, you can approximate any continuous function to arbitrary precision. This is similar to how Fourier series approximate functions using sums of sines and cosines, but with learned basis functions.

Proof Sketch

Cybenko's proof uses the Hahn-Banach theorem. Suppose the span of {σ(wTx+b):w,b}\{\sigma(w^T x + b) : w, b\} is not dense in C(K)C(K). Then by Hahn-Banach, there exists a nonzero signed measure μ\mu on KK such that σ(wTx+b)dμ(x)=0\int \sigma(w^T x + b) \, d\mu(x) = 0 for all w,bw, b. Show that the sigmoidal property of σ\sigma forces μ=0\mu = 0, a contradiction. Therefore the span is dense.

Why It Matters

The UAT justifies using neural networks as a flexible function class. Before this result, it was unclear whether neural networks could represent the functions arising in real-world tasks. The theorem says they can, at least in principle, even with a single hidden layer.

Failure Mode

The theorem does not bound NN (the required width). For some functions, NN may need to be exponentially large in the input dimension dd. The theorem does not address optimization (can gradient descent find the right weights?) or generalization (will the approximation work on unseen inputs?). A network with NN parameters fit to n<Nn < N data points can perfectly interpolate training data while computing arbitrary nonsense elsewhere.

Hornik's Generalization

Hornik (1991) extended Cybenko's result beyond sigmoidal activations.

Definition

Hornik's Condition

The universal approximation property holds for any non-constant, bounded, continuous activation function σ\sigma. Hornik further showed that measurable sigmoidal functions suffice (continuity of σ\sigma is not required). Later work (Leshno et al., 1993) showed that universal approximation holds for any activation that is not a polynomial.

This means ReLU, tanh, sigmoid, ELU, and every activation function used in practice satisfies the UAT. The one exception: polynomial activations of fixed degree kk can only represent polynomials of degree at most kk.

What the Theorem Does NOT Say

1. How many neurons you need. The theorem guarantees existence of NN but gives no bound. For approximating a function with dd-dimensional input to accuracy ϵ\epsilon, worst-case bounds are exponential: N=Ω(1/ϵd)N = \Omega(1/\epsilon^d). This is the curse of dimensionality for approximation.

2. How to find the weights. The proof is non-constructive. It does not provide an algorithm for computing αj,wj,bj\alpha_j, w_j, b_j. In practice, we use gradient descent, which may get stuck in bad local minima or saddle points (though this is less of a problem in practice than theory predicts).

3. Whether the network generalizes. Approximating a function on a compact set to accuracy ϵ\epsilon says nothing about how the network behaves on inputs outside KK or on unseen inputs within KK. Generalization requires controlling the hypothesis class complexity (via VC dimension, Rademacher complexity, etc.), which is a separate question.

Width vs Depth

Theorem

Depth Separation

Statement

There exist families of functions that can be represented exactly by a ReLU network of depth kk and width O(d)O(d), but require width Ω(2k1)\Omega(2^{k-1}) if restricted to depth 2 (a single hidden layer). Deeper networks can be exponentially more parameter-efficient than shallow ones.

Intuition

Depth enables composition. A function that is naturally expressed as fkfk1f1f_k \circ f_{k-1} \circ \cdots \circ f_1 can be represented with O(kd)O(kd) parameters using kk layers, each implementing one fif_i. A single hidden layer must "flatten" this composition, requiring exponentially many neurons to represent all the intermediate computations simultaneously.

Proof Sketch

Telgarsky (2016) constructed explicit triangle-wave functions that can be represented by depth-kk ReLU networks with O(k)O(k) neurons but require Ω(2k/2)\Omega(2^{k/2}) neurons in any depth-2 network. The argument uses the observation that composing ReLU functions doubles the number of linear regions, so a depth-kk network has 2k2^k linear regions, which a shallow network needs exponentially many neurons to match.

Why It Matters

This explains why deep networks outperform shallow wide networks in practice. Real-world functions (image classification, language modeling) are compositional in nature: edges compose into textures, textures into parts, parts into objects. Deep networks exploit this compositional structure; shallow networks cannot.

Failure Mode

Depth separation results are worst-case. For some function classes (e.g., smooth functions with bounded derivatives), shallow networks approximate almost as efficiently as deep ones. The practical advantage of depth depends on whether the target function has compositional structure.

Common Confusions

Watch Out

Universal approximation does not mean neural networks are the best approximators

Polynomials, Fourier series, wavelets, and splines are also universal approximators for continuous functions on compact sets (by the Stone-Weierstrass theorem or its variants). The UAT for neural networks is not unique. What makes neural networks special is their practical trainability via gradient descent and their efficiency for functions with compositional structure, not the approximation theorem itself.

Watch Out

ReLU networks are universal approximators even though ReLU is not bounded

Cybenko's original result required bounded (sigmoidal) activations. But ReLU networks are also universal approximators. The proof is different: ReLU networks can represent piecewise linear functions, and any continuous function on a compact set can be approximated uniformly by piecewise linear functions. The requirement is not boundedness of σ\sigma but non-polynomiality (Leshno et al., 1993).

Watch Out

More parameters does not mean better approximation on test data

A network with 10610^6 parameters can approximate the training data perfectly but fail on test data. The UAT is about approximation capacity (bias), not generalization (variance). Controlling generalization requires regularization, early stopping, or structural constraints on the network.

Exercises

ExerciseCore

Problem

Consider a single hidden layer ReLU network f(x)=j=1Nαjmax(wjx+bj,0)f(x) = \sum_{j=1}^N \alpha_j \max(w_j x + b_j, 0) for xRx \in \mathbb{R}. How many linear "pieces" can this function have? What is the maximum number of breakpoints?

ExerciseAdvanced

Problem

A depth-kk ReLU network with nn neurons per layer can have up to O(nk)O(n^k) linear regions. A depth-2 network with WW neurons has at most O(W)O(W) regions. If the target function has RR linear regions, express the minimum width WW for depth-2 and the minimum total neurons for depth-kk in terms of RR.

References

Canonical:

  • Cybenko, "Approximation by Superpositions of a Sigmoidal Function" (1989)
  • Hornik, "Approximation Capabilities of Multilayer Feedforward Networks" (1991)

Current:

  • Telgarsky, "Benefits of Depth in Neural Networks" (COLT 2016)

  • Leshno et al., "Multilayer Feedforward Networks with a Nonpolynomial Activation Function Can Approximate Any Function" (Neural Networks, 1993)

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.