Universal Approximation Theorem

Sneiderman, Robby

ML Methods

Universal Approximation Theorem

A single hidden layer neural network can approximate any continuous function on a compact set to arbitrary accuracy. Why this is both important and misleading: it says nothing about width, weight-finding, or generalization.

CoreTier 1StableSupporting~45 min

Prerequisites

Feedforward Networks and Backpropagation

Start 8-question practice · 5 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 2 | tier 1. This page has 1 direct prerequisite and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Kolmogorov-Arnold Networks (KANs)

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The universal approximation theorem (UAT) answers a foundational question: can neural networks represent the functions we need them to represent? The answer is yes, under mild conditions. A single hidden layer with enough neurons can approximate any continuous function on a compact set.

But the theorem is frequently misunderstood. It is an existence result, not a constructive one. It guarantees that the right weights exist, not that gradient descent can find them. It guarantees that some width suffices, not that the required width is practical. And it says nothing about generalization. A network that perfectly approximates a function on training data may fail completely on new data.

Formal Statement

Definition

Feedforward Network with One Hidden Layer

A function $f: \mathbb{R}^d \to \mathbb{R}$ of the form:

$f(x) = \sum_{j=1}^{N} \alpha_j \, \sigma(w_j^T x + b_j)$

where $\sigma$ is the activation function, $w_j \in \mathbb{R}^d$ are weight vectors, $b_j \in \mathbb{R}$ are biases, $\alpha_j \in \mathbb{R}$ are output weights, and $N$ is the number of hidden neurons (the width).

Theorem

Universal Approximation (Cybenko, 1989)

Statement

Let $\sigma$ be any continuous sigmoidal function (i.e., $\sigma(t) \to 1$ as $t \to +\infty$ and $\sigma(t) \to 0$ as $t \to -\infty$ ). For any continuous function $g: K \to \mathbb{R}$ on a compact set $K \subset \mathbb{R}^d$ and any $\epsilon > 0$ , there exists $N \in \mathbb{N}$ and parameters $\{\alpha_j, w_j, b_j\}_{j=1}^N$ such that:

$\sup_{x \in K} \left| g(x) - \sum_{j=1}^{N} \alpha_j \, \sigma(w_j^T x + b_j) \right| < \epsilon$

Intuition

Each hidden neuron $\sigma(w_j^T x + b_j)$ implements a soft step function in the direction $w_j$ . By combining many step functions at different orientations, positions, and scales, you can approximate any continuous function to arbitrary precision. This is similar to how Fourier series approximate functions using sums of sines and cosines, but with learned basis functions.

Proof Sketch

Cybenko's proof uses the Hahn-Banach theorem. Suppose the span of $\{\sigma(w^T x + b) : w, b\}$ is not dense in $C(K)$ . Then by Hahn-Banach, there exists a nonzero signed measure $\mu$ on $K$ such that $\int \sigma(w^T x + b) \, d\mu(x) = 0$ for all $w, b$ . Show that the sigmoidal property of $\sigma$ forces $\mu = 0$ , a contradiction. Therefore the span is dense.

Why It Matters

The UAT justifies using neural networks as a flexible function class. Before this result, it was unclear whether neural networks could represent the functions arising in real-world tasks. The theorem says they can, at least in principle, even with a single hidden layer.

Failure Mode

The theorem does not bound $N$ (the required width). For some functions, $N$ may need to be exponentially large in the input dimension $d$ . The theorem does not address optimization (can gradient descent find the right weights?) or generalization (will the approximation work on unseen inputs?). A network with $N$ parameters fit to $n < N$ data points can perfectly interpolate training data while computing arbitrary nonsense elsewhere.

report a correction →

Hornik's Generalization

Hornik (1991) extended Cybenko's result beyond sigmoidal activations.

Definition

Hornik's Condition

The universal approximation property holds for any non-constant, bounded, continuous activation function $\sigma$ . Hornik further showed that measurable sigmoidal functions suffice (continuity of $\sigma$ is not required). Later work (Leshno et al., 1993) showed that universal approximation holds for any activation that is not a polynomial.

This means ReLU, tanh, sigmoid, ELU, and every activation function used in practice satisfies the UAT. The one exception: polynomial activations of fixed degree $k$ can only represent polynomials of degree at most $k$ .

What the Theorem Does NOT Say

1. How many neurons you need. The theorem guarantees existence of $N$ but gives no bound. For approximating a function with $d$ -dimensional input to accuracy $\epsilon$ , worst-case bounds are exponential: $N = \Omega(1/\epsilon^d)$ . This is the curse of dimensionality for approximation.

2. How to find the weights. The proof is non-constructive. It does not provide an algorithm for computing $\alpha_j, w_j, b_j$ . In practice, we use gradient descent, which may get stuck in bad local minima or saddle points (though this is less of a problem in practice than theory predicts).

3. Whether the network generalizes. Approximating a function on a compact set to accuracy $\epsilon$ says nothing about how the network behaves on inputs outside $K$ or on unseen inputs within $K$ . Generalization requires controlling the hypothesis class complexity (via VC dimension, Rademacher complexity, etc.), which is a separate question.

Width vs Depth

Theorem

Depth Separation

Statement

There exist families of functions that can be represented exactly by a ReLU network of depth $k$ and width $O(d)$ , but require width $\Omega(2^k)$ if restricted to depth 2 (a single hidden layer). Deeper networks can be exponentially more parameter-efficient than shallow ones.

Intuition

Depth enables composition. A function that is naturally expressed as $f_k \circ f_{k-1} \circ \cdots \circ f_1$ can be represented with $O(kd)$ parameters using $k$ layers, each implementing one $f_i$ . A single hidden layer must "flatten" this composition, requiring exponentially many neurons to represent all the intermediate computations simultaneously.

Proof Sketch

Telgarsky (2016) constructed explicit triangle-wave functions that can be represented by depth- $k$ ReLU networks with $O(k)$ neurons but require $\Omega(2^k)$ neurons in any depth-2 network. The argument uses the observation that composing ReLU functions doubles the number of linear regions, so a depth- $k$ network has $2^k$ linear regions, which a shallow network needs exponentially many neurons to match.

Why It Matters

This explains why deep networks outperform shallow wide networks in practice. Real-world functions (image classification, language modeling) are compositional in nature: edges compose into textures, textures into parts, parts into objects. Deep networks exploit this compositional structure; shallow networks cannot.

Failure Mode

Depth separation results are worst-case. For some function classes (e.g., smooth functions with bounded derivatives), shallow networks approximate almost as efficiently as deep ones. The practical advantage of depth depends on whether the target function has compositional structure.

report a correction →

Common Confusions

Watch Out

Universal approximation does not mean neural networks are the best approximators

Polynomials, Fourier series, wavelets, and splines are also universal approximators for continuous functions on compact sets (by the Stone-Weierstrass theorem or its variants). The UAT for neural networks is not unique. What makes neural networks special is their practical trainability via gradient descent and their efficiency for functions with compositional structure, not the approximation theorem itself.

Watch Out

ReLU networks are universal approximators even though ReLU is not bounded

Cybenko's original result required bounded (sigmoidal) activations. But ReLU networks are also universal approximators. The proof is different: ReLU networks can represent piecewise linear functions, and any continuous function on a compact set can be approximated uniformly by piecewise linear functions. The requirement is not boundedness of $\sigma$ but non-polynomiality (Leshno et al., 1993).

Watch Out

More parameters does not mean better approximation on test data

A network with $10^6$ parameters can approximate the training data perfectly but fail on test data. The UAT is about approximation capacity (bias), not generalization (variance). Controlling generalization requires regularization, early stopping, or structural constraints on the network.

Exercises

ExerciseCore

Problem

Consider a single hidden layer ReLU network $f(x) = \sum_{j=1}^N \alpha_j \max(w_j x + b_j, 0)$ for $x \in \mathbb{R}$ . How many linear "pieces" can this function have? What is the maximum number of breakpoints?

ExerciseAdvanced

Problem

A depth- $k$ ReLU network with $n$ neurons per layer can have up to $O(n^k)$ linear regions. A depth-2 network with $W$ neurons has at most $O(W)$ regions. If the target function has $R$ linear regions, express the minimum width $W$ for depth-2 and the minimum total neurons for depth- $k$ in terms of $R$ .

References

Canonical:

Cybenko, "Approximation by Superpositions of a Sigmoidal Function" (1989)
Hornik, "Approximation Capabilities of Multilayer Feedforward Networks" (1991)

Dimension-independent rates:

Barron, "Universal Approximation Bounds for Superpositions of a Sigmoidal Function" (IEEE Trans. IT, 1993)

Depth separation:

Telgarsky, "Benefits of Depth in Neural Networks" (COLT 2016). arXiv:1602.04485
Eldan & Shamir, "The Power of Depth for Feedforward Neural Networks" (COLT 2016). arXiv:1512.03965

Characterization:

Leshno et al., "Multilayer Feedforward Networks with a Nonpolynomial Activation Function Can Approximate Any Function" (Neural Networks, 1993)

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Feedforward Networks and Backpropagationlayer 2 · tier 1

Derived topics

1

Kolmogorov-Arnold Networks (KANs)layer 4 · tier 2

Graph-backed continuations

Kolmogorov-Arnold Networks (KANs)