ML Methods
Universal Approximation Theorem
A single hidden layer neural network can approximate any continuous function on a compact set to arbitrary accuracy. Why this is both important and misleading: it says nothing about width, weight-finding, or generalization.
Prerequisites
Why This Matters
The universal approximation theorem (UAT) answers a foundational question: can neural networks represent the functions we need them to represent? The answer is yes, under mild conditions. A single hidden layer with enough neurons can approximate any continuous function on a compact set.
But the theorem is frequently misunderstood. It is an existence result, not a constructive one. It guarantees that the right weights exist, not that gradient descent can find them. It guarantees that some width suffices, not that the required width is practical. And it says nothing about generalization. A network that perfectly approximates a function on training data may fail completely on new data.
Formal Statement
Feedforward Network with One Hidden Layer
A function of the form:
where is the activation function, are weight vectors, are biases, are output weights, and is the number of hidden neurons (the width).
Universal Approximation (Cybenko, 1989)
Statement
Let be any continuous sigmoidal function (i.e., as and as ). For any continuous function on a compact set and any , there exists and parameters such that:
Intuition
Each hidden neuron implements a soft step function in the direction . By combining many step functions at different orientations, positions, and scales, you can approximate any continuous function to arbitrary precision. This is similar to how Fourier series approximate functions using sums of sines and cosines, but with learned basis functions.
Proof Sketch
Cybenko's proof uses the Hahn-Banach theorem. Suppose the span of is not dense in . Then by Hahn-Banach, there exists a nonzero signed measure on such that for all . Show that the sigmoidal property of forces , a contradiction. Therefore the span is dense.
Why It Matters
The UAT justifies using neural networks as a flexible function class. Before this result, it was unclear whether neural networks could represent the functions arising in real-world tasks. The theorem says they can, at least in principle, even with a single hidden layer.
Failure Mode
The theorem does not bound (the required width). For some functions, may need to be exponentially large in the input dimension . The theorem does not address optimization (can gradient descent find the right weights?) or generalization (will the approximation work on unseen inputs?). A network with parameters fit to data points can perfectly interpolate training data while computing arbitrary nonsense elsewhere.
Hornik's Generalization
Hornik (1991) extended Cybenko's result beyond sigmoidal activations.
Hornik's Condition
The universal approximation property holds for any non-constant, bounded, continuous activation function . Hornik further showed that measurable sigmoidal functions suffice (continuity of is not required). Later work (Leshno et al., 1993) showed that universal approximation holds for any activation that is not a polynomial.
This means ReLU, tanh, sigmoid, ELU, and every activation function used in practice satisfies the UAT. The one exception: polynomial activations of fixed degree can only represent polynomials of degree at most .
What the Theorem Does NOT Say
1. How many neurons you need. The theorem guarantees existence of but gives no bound. For approximating a function with -dimensional input to accuracy , worst-case bounds are exponential: . This is the curse of dimensionality for approximation.
2. How to find the weights. The proof is non-constructive. It does not provide an algorithm for computing . In practice, we use gradient descent, which may get stuck in bad local minima or saddle points (though this is less of a problem in practice than theory predicts).
3. Whether the network generalizes. Approximating a function on a compact set to accuracy says nothing about how the network behaves on inputs outside or on unseen inputs within . Generalization requires controlling the hypothesis class complexity (via VC dimension, Rademacher complexity, etc.), which is a separate question.
Width vs Depth
Depth Separation
Statement
There exist families of functions that can be represented exactly by a ReLU network of depth and width , but require width if restricted to depth 2 (a single hidden layer). Deeper networks can be exponentially more parameter-efficient than shallow ones.
Intuition
Depth enables composition. A function that is naturally expressed as can be represented with parameters using layers, each implementing one . A single hidden layer must "flatten" this composition, requiring exponentially many neurons to represent all the intermediate computations simultaneously.
Proof Sketch
Telgarsky (2016) constructed explicit triangle-wave functions that can be represented by depth- ReLU networks with neurons but require neurons in any depth-2 network. The argument uses the observation that composing ReLU functions doubles the number of linear regions, so a depth- network has linear regions, which a shallow network needs exponentially many neurons to match.
Why It Matters
This explains why deep networks outperform shallow wide networks in practice. Real-world functions (image classification, language modeling) are compositional in nature: edges compose into textures, textures into parts, parts into objects. Deep networks exploit this compositional structure; shallow networks cannot.
Failure Mode
Depth separation results are worst-case. For some function classes (e.g., smooth functions with bounded derivatives), shallow networks approximate almost as efficiently as deep ones. The practical advantage of depth depends on whether the target function has compositional structure.
Common Confusions
Universal approximation does not mean neural networks are the best approximators
Polynomials, Fourier series, wavelets, and splines are also universal approximators for continuous functions on compact sets (by the Stone-Weierstrass theorem or its variants). The UAT for neural networks is not unique. What makes neural networks special is their practical trainability via gradient descent and their efficiency for functions with compositional structure, not the approximation theorem itself.
ReLU networks are universal approximators even though ReLU is not bounded
Cybenko's original result required bounded (sigmoidal) activations. But ReLU networks are also universal approximators. The proof is different: ReLU networks can represent piecewise linear functions, and any continuous function on a compact set can be approximated uniformly by piecewise linear functions. The requirement is not boundedness of but non-polynomiality (Leshno et al., 1993).
More parameters does not mean better approximation on test data
A network with parameters can approximate the training data perfectly but fail on test data. The UAT is about approximation capacity (bias), not generalization (variance). Controlling generalization requires regularization, early stopping, or structural constraints on the network.
Exercises
Problem
Consider a single hidden layer ReLU network for . How many linear "pieces" can this function have? What is the maximum number of breakpoints?
Problem
A depth- ReLU network with neurons per layer can have up to linear regions. A depth-2 network with neurons has at most regions. If the target function has linear regions, express the minimum width for depth-2 and the minimum total neurons for depth- in terms of .
References
Canonical:
- Cybenko, "Approximation by Superpositions of a Sigmoidal Function" (1989)
- Hornik, "Approximation Capabilities of Multilayer Feedforward Networks" (1991)
Current:
-
Telgarsky, "Benefits of Depth in Neural Networks" (COLT 2016)
-
Leshno et al., "Multilayer Feedforward Networks with a Nonpolynomial Activation Function Can Approximate Any Function" (Neural Networks, 1993)
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A