Feedforward Networks and Backpropagation

Sneiderman, Robby

ML Methods

Feedforward Networks and Backpropagation

Feedforward neural networks as compositions of affine transforms and nonlinearities, the universal approximation theorem, and backpropagation as reverse-mode automatic differentiation on the computational graph.

ImportantCoreTier 1StableCore spine~60 min

For:ML

Prerequisites

Differentiation in Rn Matrix Calculus Activation Functions Automatic Differentiation

Start 8-question practice · 17 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 2 | tier 1. This page has 11 direct prerequisites and 33 published dependents.

Open Atlas Prerequisites Leads to

What next

Convolutional Neural Networks

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Feedforward networks are the foundation of modern deep learning. The history of deep learning traces back through multiple waves of enthusiasm and disillusionment, but every architecture -- convolutional networks, transformers, residual networks -- is built from the same primitive: layers of affine transformations followed by nonlinear activation functions. Understanding this primitive deeply -- what it can represent, how to train it, and what goes wrong -- is essential before studying any specialized architecture.

Backpropagation is the algorithm that makes training possible. It is not a learning algorithm itself -- it is an efficient method for computing gradients. Without backpropagation, training a network with $n$ parameters would require $O(n)$ forward passes to compute the gradient (one per parameter). Backpropagation computes the full gradient in a single backward pass, making deep learning computationally feasible.

Mental Model

A feedforward network is a pipeline: input flows through a series of layers, each applying a linear transformation (matrix multiply + bias) followed by a nonlinearity (ReLU, sigmoid, tanh). The output of each layer is the input to the next. Training means adjusting the matrices and biases so that the final output matches the desired labels.

Backpropagation computes how much each weight contributed to the error by propagating the error signal backward through the network, using the chain rule at each layer. It is reverse-mode automatic differentiation applied to the computational graph of the network.

Feedforward Architecture

Definition

Feedforward Neural Network $f (x; θ)$

An $L$ -layer feedforward neural network computes:

$f(x; \theta) = W_L \sigma(W_{L-1} \sigma(\cdots \sigma(W_1 x + b_1) \cdots) + b_{L-1}) + b_L$

where $W_\ell \in \mathbb{R}^{d_{\ell} \times d_{\ell-1}}$ are weight matrices, $b_\ell \in \mathbb{R}^{d_\ell}$ are bias vectors, and $\sigma(\cdot)$ is a nonlinear activation function applied elementwise. The parameters are $\theta = \{W_1, b_1, \ldots, W_L, b_L\}$ .

Layer by layer, the computation is:

$a_\ell = W_\ell z_{\ell-1} + b_\ell \qquad \text{(pre-activation)}$ $z_\ell = \sigma(a_\ell) \qquad \text{(post-activation)}$

where $z_0 = x$ is the input and $z_L = f(x; \theta)$ is the output (the final layer may omit the nonlinearity for regression, or use softmax for classification).

Definition

Width and Depth

The width of layer $\ell$ is $d_\ell$ (the number of neurons). The depth of the network is $L$ (the number of layers). A "deep" network has many layers; a "wide" network has many neurons per layer. Depth enables hierarchical feature composition; width enables representing more features at each level.

Universal Approximation

Theorem

Universal Approximation Theorem

Statement

For any continuous function $g: K \to \mathbb{R}$ on a compact set $K \subset \mathbb{R}^d$ and any $\epsilon > 0$ , there exists a single-hidden-layer network $f(x) = \sum_{j=1}^N \alpha_j \sigma(w_j^\top x + b_j)$ such that:

$\sup_{x \in K} |f(x) - g(x)| < \epsilon$

That is, feedforward networks with one hidden layer are dense in the space of continuous functions on compact sets, in the supremum norm.

Intuition

A single hidden layer with enough neurons can approximate any continuous function to arbitrary accuracy. Each neuron $\sigma(w_j^\top x + b_j)$ carves out a half-space in input space. With enough half-spaces, you can build any shape. Think of it as approximation by a very flexible linear combination of basis functions.

Proof Sketch

The original proof by Cybenko (1989) for sigmoid activations uses the Hahn-Banach theorem and the Riesz representation theorem. The idea: suppose the set of functions representable by the network is not dense. Then by Hahn-Banach, there exists a nonzero bounded linear functional that vanishes on all such functions. This functional corresponds to a signed measure $\mu$ . Show that $\int \sigma(w^\top x + b) \, d\mu(x) = 0$ for all $w, b$ implies $\mu = 0$ , giving a contradiction.

Why It Matters

Universal approximation tells you that neural networks are expressive enough -- the hypothesis class is rich enough to approximate any target. But it says absolutely nothing about:

How to find the weights: the optimization landscape is non-convex
How many neurons are needed: the width $N$ may need to be exponentially large in $d$
Whether the learned function generalizes: expressiveness is about approximation error, not estimation error

This is an existence theorem, not a constructive one. It guarantees the capacity is there but gives no guidance on learning.

Failure Mode

The theorem requires arbitrary width. For fixed-width networks, depth matters: there exist functions that require exponential width with bounded depth but only polynomial width with sufficient depth. This is the theoretical motivation for deep (many-layer) networks over wide shallow ones.

report a correction →

Backpropagation

Backpropagation computes the gradient $\nabla_\theta \mathcal{L}$ of the loss with respect to all parameters. It is the chain rule applied systematically from output to input.

Forward pass: compute and store all pre-activations $a_\ell$ and post-activations $z_\ell$ for $\ell = 1, \ldots, L$ .

Backward pass: starting from the loss gradient $\delta_L = \partial \mathcal{L} / \partial a_L$ , propagate backward:

$\delta_\ell = (W_{\ell+1}^\top \delta_{\ell+1}) \odot \sigma'(a_\ell)$

where $\odot$ denotes elementwise multiplication and $\sigma'(a_\ell)$ is the derivative of the activation at the pre-activation values.

The parameter gradients at layer $\ell$ are:

$\frac{\partial \mathcal{L}}{\partial W_\ell} = \delta_\ell z_{\ell-1}^\top, \qquad \frac{\partial \mathcal{L}}{\partial b_\ell} = \delta_\ell$

Depth8 layersActivationWeight scaledead ReLU gate

The lab above makes the recurrence visible. Each bar is the scale of the backward signal reaching one layer, measured relative to the output-side gradient. A balanced initialization keeps the bars near the same height. Small weights or sigmoid derivatives turn the leftmost bars into nearly zero: the early layers receive gradients too small to move their features. Large weights do the opposite: the bars grow as the signal moves backward, so an ordinary learning rate becomes a destabilizing update. The dead ReLU toggle shows a different failure: one zero derivative blocks every upstream path through that unit, even when the surrounding weight scale is reasonable.

Worked Mechanism: One Layer in Reverse

For a single dense layer, write the forward computation as

a = Wz + b,\qquad u = \sigma(a).

Suppose the rest of the network has already supplied the adjoint $\bar{u} = \partial L/\partial u$ . The reverse pass through this layer has two local steps. First, pass through the activation:

\bar{a} = \bar{u} \odot \sigma'(a).

Second, pass through the affine map:

\bar{W} = \bar{a}z^\top,\qquad \bar{b} = \bar{a},\qquad \bar{z} = W^\top\bar{a}.

The shapes are the sanity check. If $W \in \mathbb{R}^{m \times d}$ , $z \in \mathbb{R}^d$ , and $a,u,\bar{a} \in \mathbb{R}^m$ , then $\bar{a}z^\top \in \mathbb{R}^{m \times d}$ , matching $W$ . The input adjoint $W^\top\bar{a}$ lands back in $\mathbb{R}^d$ , matching the previous layer's activation. Most implementation bugs in hand-written backprop show up as a shape mismatch in one of these three equations.

For a minibatch with activations arranged as rows, $Z \in \mathbb{R}^{B \times d}$ and $A = ZW^\top + \mathbf{1}b^\top \in \mathbb{R}^{B \times m}$ . If $\bar{A} \in \mathbb{R}^{B \times m}$ is the batch of activation adjoints, the weight gradient is

\bar{W} = \bar{A}^\top Z.

That formula explains the common "matrix multiply in reverse" pattern in deep learning kernels. The forward dense layer multiplies input activations by weights. The backward dense layer multiplies output adjoints by saved input activations. Training cost is high because every layer performs comparable matrix multiplications in both directions, not because the chain rule itself is conceptually hard.

The cache is not optional. To compute $\bar{W}$ , the reverse pass needs the forward value $z$ . To compute $\bar{a}$ for nonlinear activations, it usually needs either $a$ or $u$ . If those values are not stored, the system must recompute them exactly. Recomputing is valid only when the forward operation is deterministic under the same random seeds and masks. Dropout, data-dependent branches, mixed-precision rounding, and in-place mutation all make this harder than the blackboard recurrence suggests.

Mathematical backprop and optimizer behavior are different layers. Backprop returns $\bar{W}$ and $\bar{b}$ for the current loss and current parameters. It does not choose the learning rate, clip exploding gradients, average across minibatches, add momentum, or regularize the weights. Those choices belong to SGD, Adam, and the broader optimizer layer. Confusing the gradient subroutine with the update rule hides where training failures actually enter.

There are two common batch-level mistakes worth naming. First, the gradient of the average loss over a minibatch should average the per-example adjoints. If the loss is

L_B(\theta) = \frac{1}{B}\sum_{i=1}^B L_i(\theta),

then every parameter gradient carries the same factor $1/B$ . Omitting the factor silently changes the learning-rate scale when the batch size changes. This is why moving from batch size 32 to 256 can destabilize a handwritten training loop even when the symbolic derivative is otherwise correct: the update became 8 times larger because the gradient was summed rather than averaged.

Second, broadcasting bias terms must be reversed by summing over the batch axis. In the forward pass, $b \in \mathbb{R}^m$ is copied into every row of $A$ . In the backward pass, those copies collapse:

\bar{b} = \sum_{i=1}^B \bar{A}_{i,\cdot}.

A bias gradient with shape $B \times m$ is not a valid parameter gradient; it is the set of per-example contributions before reduction. Autodiff libraries handle this reduction automatically, but the rule matters when reading kernel code or debugging a custom operation.

The same local reasoning explains why residual connections improve trainability. For a residual block $u = z + F(z)$ , the input adjoint is

\bar{z} = \bar{u} + J_F(z)^\top\bar{u}.

Even if the learned branch $F$ has small or poorly conditioned Jacobian, the identity path sends $\bar{u}$ directly backward. Residual networks do not avoid backprop; they change the Jacobian product so a clean signal path remains available across many layers.

Layer normalization and careful initialization serve the same goal from a different angle. They do not alter the chain rule, but they keep the local Jacobians in a range where the product across layers is less likely to vanish or explode. This is why a modern transformer block combines residual paths, normalization, and initialization rules rather than relying on backprop alone. The gradient algorithm is general; trainable architecture is the extra design work that keeps its output numerically useful.

Proposition

Backpropagation Complexity

Statement

For a feedforward network with $n$ total parameters, backpropagation computes the full gradient $\nabla_\theta \mathcal{L}$ in $O(n)$ time -- the same order as a single forward pass. Computing each partial derivative individually by finite differences would require $O(n)$ forward passes, giving $O(n^2)$ total time.

Intuition

The key insight is that backpropagation reuses intermediate computations. The error signal $\delta_\ell$ at layer $\ell$ is computed from $\delta_{\ell+1}$ by a single matrix-vector multiply. Each layer adds $O(d_\ell d_{\ell-1})$ work -- the same as the forward pass through that layer. Since the forward pass is $O(n)$ (summing over all layer sizes), the backward pass is also $O(n)$ .

This is not a coincidence. Backpropagation is reverse-mode automatic differentiation: it computes all $n$ partial derivatives simultaneously by traversing the computational graph once in reverse.

Proof Sketch

The forward pass through layer $\ell$ computes $a_\ell = W_\ell z_{\ell-1} + b_\ell$ (cost $O(d_\ell d_{\ell-1})$ ) and $z_\ell = \sigma(a_\ell)$ (cost $O(d_\ell)$ ). Total forward cost: $\sum_\ell O(d_\ell d_{\ell-1}) = O(n)$ .

The backward pass at layer $\ell$ computes $\delta_\ell = (W_{\ell+1}^\top \delta_{\ell+1}) \odot \sigma'(a_\ell)$ (cost $O(d_{\ell+1} d_\ell)$ ) and $\partial \mathcal{L}/\partial W_\ell = \delta_\ell z_{\ell-1}^\top$ (cost $O(d_\ell d_{\ell-1})$ ). Total backward cost: $O(n)$ .

Why It Matters

This linear-time complexity is what makes deep learning possible. A modern language model has billions of parameters. Computing the gradient in $O(n)$ rather than $O(n^2)$ is the difference between training in weeks and never training at all.

Failure Mode

Backpropagation requires storing all intermediate activations $z_\ell$ for the backward pass. This is the memory bottleneck of training: memory scales linearly with depth and batch size. Gradient checkpointing trades computation for memory by recomputing activations during the backward pass instead of storing them.

There is also a graph-structure failure mode. The $O(n)$ claim assumes the forward pass stores exactly the local values needed by each reverse step. If the implementation discards activations, mutates them in place before their adjoint is used, or builds a dynamic graph whose branches are not replayed consistently, the backward pass either recomputes work or returns the wrong gradient. Modern autodiff systems spend much of their engineering budget on this bookkeeping: saved tensors, version counters, checkpoint boundaries, and deterministic replay for stochastic layers.

report a correction →

Watch Out

Backprop does not guarantee useful gradients

Backprop exactly computes the gradient of the objective represented by the current computation graph, up to numerical precision. That does not mean the gradient is useful. It may be nearly zero in early layers, too large to step with the chosen learning rate, dominated by minibatch noise, or pointed toward a sharp region that generalizes poorly. The theorem is an efficiency theorem for gradient computation; it is not a convergence theorem and not a generalization theorem.

Example

Backprop reuses adjoints instead of recomputing paths

Take the scalar chain

x \mapsto a = w_1x \mapsto h = \sigma(a) \mapsto \hat{y} = w_2h \mapsto L = \tfrac{1}{2}(\hat{y} - y)^2.

A finite-difference estimate of $\partial L/\partial w_1$ perturbs $w_1$ , runs the full forward pass, then repeats for $w_2$ . With two weights that cost is tolerable; with two billion weights it is not.

Backprop computes one output adjoint and reuses it:

\bar{L} = 1,\qquad \bar{\hat{y}} = \hat{y} - y,\qquad \bar{w}_2 = \bar{\hat{y}}h,\qquad \bar{h} = \bar{\hat{y}}w_2,

\bar{a} = \bar{h}\sigma'(a),\qquad \bar{w}_1 = \bar{a}x,\qquad \bar{x} = \bar{a}w_1.

The same reverse sweep produces gradients for $w_1$ , $w_2$ , and the input. No parameter gets its own forward pass. The stored forward values $a$ and $h$ are the only extra data needed for this chain.

Vanishing and Exploding Gradients

The backward recurrence $\delta_\ell = (W_{\ell+1}^\top \delta_{\ell+1}) \odot \sigma'(a_\ell)$ multiplies by $\sigma'(a_\ell)$ at each layer. Over $L$ layers:

$\|\delta_1\| \approx \|\delta_L\| \prod_{\ell=1}^{L-1} \|W_{\ell+1}^\top\| \cdot |\sigma'(a_\ell)|$

If $\|W_\ell^\top\| \cdot |\sigma'| < 1$ at most layers, the product shrinks exponentially: vanishing gradients. Early layers learn extremely slowly.
If $\|W_\ell^\top\| \cdot |\sigma'| > 1$ at most layers, the product grows exponentially: exploding gradients. Training becomes unstable.

Example

Signal scale across 16 sigmoid layers

Suppose each weight matrix has spectral norm near 1 and every sigmoid unit is near its highest derivative, so $|\sigma'(a_\ell)| \leq 0.25$ . Even in this best local regime for sigmoid, a 16-layer network sends at most

0.25^{15} \approx 9.3 \times 10^{-10}

of the output-side gradient to the first layer. If the output loss supplies a gradient of order 1, the first layer sees an update of order $10^{-9}\eta$ . With learning rate $\eta = 10^{-3}$ , that is an update near $10^{-12}$ before accounting for matrix norms and minibatch noise. The first layer is mathematically trainable, but numerically it barely changes.

This is why the vanishing-gradient issue is not just a vague warning about "deep networks." It is a product of the actual chain-rule factors. Change the activation derivative, the singular values of the weight matrices, or the normalization scheme, and the product changes.

Example

A ReLU gate can kill an upstream feature

For ReLU, the derivative is 1 on positive pre-activations and 0 on negative pre-activations. Consider one hidden unit with $a = w^\top x + b < 0$ on a minibatch. Its activation is $h = 0$ , and its local derivative is also 0. The backward term becomes

\bar{a} = \bar{h}\,\mathbf{1}[a > 0] = 0.

Every parameter feeding that unit receives zero gradient on that minibatch: $\bar{w} = \bar{a}x = 0$ and $\bar{b} = \bar{a} = 0$ . If the bias and weights keep the unit negative across most data, the unit becomes dead. ReLU fixes the sigmoid saturation problem on active paths, but it creates a gate-level counterexample to the idea that gradients always flow backward through a network.

Activation Functions

Definition

ReLU $σ (a) = max (0, a)$

The Rectified Linear Unit is $\sigma(a) = \max(0, a)$ . Its derivative is $\sigma'(a) = 1$ for $a > 0$ and $\sigma'(a) = 0$ for $a < 0$ . ReLU does not saturate for positive inputs, which mitigates vanishing gradients. However, neurons with $a < 0$ have zero gradient ("dying ReLU").

Sigmoid: $\sigma(a) = 1/(1 + e^{-a})$ , with $\sigma' = \sigma(1 - \sigma) \leq 0.25$ . The maximum derivative is 0.25, so gradients shrink by at least a factor of 4 per layer. This causes severe vanishing gradients in deep networks.

Tanh: $\sigma(a) = \tanh(a)$ , with $\sigma' = 1 - \tanh^2(a) \leq 1$ . Better than sigmoid (centered at zero, larger maximum derivative) but still saturates for large $|a|$ .

ReLU is the default activation for hidden layers in modern networks because it avoids the saturation problem entirely for positive inputs.

Weight Initialization

Proper initialization ensures that the variance of activations and gradients remains roughly constant across layers at the start of training.

Definition

Xavier (Glorot) Initialization

For a layer with $d_{\text{in}}$ inputs and $d_{\text{out}}$ outputs, Xavier initialization sets:

$W_{ij} \sim \mathcal{N}\!\left(0, \frac{2}{d_{\text{in}} + d_{\text{out}}}\right)$

This is derived by requiring $\text{Var}(z_\ell) = \text{Var}(z_{\ell-1})$ for linear activations ( $\sigma(a) = a$ ). It works well with tanh and sigmoid activations.

Definition

He Initialization

For layers using ReLU, He initialization sets:

$W_{ij} \sim \mathcal{N}\!\left(0, \frac{2}{d_{\text{in}}}\right)$

The factor of 2 accounts for the fact that ReLU zeros out half of the activations (those with $a < 0$ ), which halves the variance. Without this correction, activations shrink toward zero in deeper layers.

Canonical Examples

Example

Two-layer network for XOR

The XOR function $y = x_1 \oplus x_2$ on $\{0, 1\}^2$ is not linearly separable. A two-layer ReLU network with 2 hidden neurons realizes it exactly (Goodfellow, Bengio, Courville 2016, §6.1):

Hidden layer: $h_1 = \text{ReLU}(x_1 + x_2)$ , $h_2 = \text{ReLU}(x_1 + x_2 - 1)$
Output: $y = h_1 - 2 h_2$

Verify on all four inputs: $(0,0) \mapsto h_1 {=} 0, h_2 {=} 0, y {=} 0$ ; $(1,0) \mapsto 1, 0, 1$ ; $(0,1) \mapsto 1, 0, 1$ ; $(1,1) \mapsto 2, 1, 0$ . The hidden layer folds the input space so the two "XOR $= 1$ " corners collapse onto the same half-line and the output becomes linearly separable. This folding is the defining power of neural networks: learning nonlinear intermediate representations from data.

Example

One MSE backprop step on the XOR network with numbers

Take the same 2-2-1 architecture as the XOR example above, but with weights slightly off the closed-form solution so the gradient is nonzero. Use ReLU hidden units and a linear output, mean-squared error loss $L = \tfrac{1}{2}(\hat{y} - y)^2$ , learning rate $\eta = 0.1$ , and the single training point $(x_1, x_2, y) = (1, 0, 1)$ .

Initial parameters. $W^{(1)} = \begin{bmatrix} 1.0 & 1.0 \\ 1.0 & 1.0 \end{bmatrix}$ , $b^{(1)} = (-0.4,\, -1.4)$ , $W^{(2)} = (0.9,\, -1.8)$ , $b^{(2)} = 0.1$ .

Forward pass.

Hidden pre-activations: $z^{(1)}_1 = 1{\cdot}1 + 1{\cdot}0 - 0.4 = 0.6$ , $z^{(1)}_2 = 1 + 0 - 1.4 = -0.4$ .
Hidden activations: $h_1 = \mathrm{ReLU}(0.6) = 0.6$ , $h_2 = \mathrm{ReLU}(-0.4) = 0$ .
Output: $\hat{y} = 0.9 \cdot 0.6 + (-1.8) \cdot 0 + 0.1 = 0.64$ .
Loss: $L = \tfrac{1}{2}(0.64 - 1)^2 = 0.0648$ .

Backward pass. With output linear and loss MSE, $\partial L / \partial \hat{y} = \hat{y} - y = -0.36$ . Walk the chain:

Output layer: $\nabla_{W^{(2)}} L = (\hat{y} - y)\, h^\top = (-0.216,\, 0)$ ; $\nabla_{b^{(2)}} L = -0.36$ .
Backprop into hidden activations: $\nabla_h L = (\hat{y} - y) W^{(2)} = (-0.324,\, 0.648)$ .
Through ReLU: $\nabla_{z^{(1)}} L = \nabla_h L \odot \mathbf{1}[z^{(1)} > 0] = (-0.324,\, 0)$ . The dead unit $h_2$ kills its half of the gradient.
Hidden layer: $\nabla_{W^{(1)}} L = \nabla_{z^{(1)}} L \cdot x^\top = \begin{bmatrix} -0.324 & 0 \\ 0 & 0 \end{bmatrix}$ ; $\nabla_{b^{(1)}} L = (-0.324,\, 0)$ .

Parameter update. $W^{(1)}_{11} \leftarrow 1.0 - 0.1 \cdot (-0.324) = 1.0324$ ; $b^{(1)}_1 \leftarrow -0.4 - 0.1 \cdot (-0.324) = -0.3676$ ; $W^{(2)}_1 \leftarrow 0.9 + 0.0216 = 0.9216$ ; $b^{(2)} \leftarrow 0.1 + 0.036 = 0.136$ . All other parameters unchanged on this step (their gradients were zero, mostly because $h_2 = 0$ blocked the second hidden unit's path).

Re-forward to confirm the loss dropped. Recompute $\hat{y}$ on the same input: $z^{(1)}_1 = 1.0324 - 0.3676 = 0.6648$ , $h_1 = 0.6648$ , $\hat{y} = 0.9216 \cdot 0.6648 + 0.136 = 0.7487$ . New loss $L = \tfrac{1}{2}(0.7487 - 1)^2 = 0.0316$ , down from $0.0648$ . One full gradient step halved the loss on this example.

The intermediate cache $(z^{(1)}, h, \hat{y})$ is what backprop stores during forward to reuse during backward — exactly the pattern from the chain-rule worked example. Goodfellow, Bengio, Courville, Deep Learning (2016), §6.5 has the general algorithm; this example is the smallest setting where every term is computable by hand.

Example

Vanishing gradient with sigmoid

Consider a 10-layer network with sigmoid activations. At each layer, the gradient is multiplied by $\sigma'(a) \leq 0.25$ . After 10 layers, the gradient at the first layer is at most $(0.25)^9 \approx 3.8 \times 10^{-6}$ times the gradient at the output. With ReLU and proper initialization, the gradient passes through unchanged (for active neurons), preserving the learning signal across all layers.

Common Confusions

Watch Out

Universal approximation does not mean easy to train

The universal approximation theorem is an existence result: a wide enough network can approximate any function. But finding the right weights via gradient descent on a non-convex loss landscape is a completely separate problem. The loss may have many local minima, saddle points, and flat regions. In practice, overparameterized networks are easier to optimize (the loss landscape becomes more benign), but this is an empirical observation, not a consequence of the approximation theorem.

Watch Out

Backpropagation is not a learning algorithm

Backpropagation computes gradients. That is all it does. The learning algorithm is gradient descent (or Adam, or SGD with momentum, etc.), which uses the gradients that backpropagation provides. Saying "we train with backpropagation" is technically imprecise -- you train with gradient descent, and backpropagation is the subroutine that makes gradient computation efficient.

Watch Out

Deep networks are not just wide networks stacked

Adding depth is qualitatively different from adding width. Depth enables compositional representations: early layers learn simple features, later layers compose them into complex features. Width at a single layer enables representing more features at one level of abstraction. There are functions (like parity) that require exponential width with bounded depth but only polynomial size with sufficient depth.

Summary

A feedforward network is a composition of affine transforms + nonlinearities
Universal approximation: one hidden layer suffices for approximation, but says nothing about learning or generalization
Backpropagation = reverse-mode autodiff = chain rule applied backward through the computational graph
Backprop computes the full gradient in $O(n)$ time, same as a forward pass
Vanishing gradients: sigmoid/tanh derivatives shrink the gradient exponentially with depth
ReLU avoids saturation for positive inputs; it is the default activation
Xavier initialization preserves variance for linear/tanh; He initialization corrects for ReLU zeroing half the activations

Exercises

ExerciseCore

Problem

Consider a 3-layer network with input $x \in \mathbb{R}^2$ , hidden layer sizes $d_1 = 4, d_2 = 3$ , and scalar output. How many parameters (weights and biases) does this network have?

ExerciseAdvanced

Problem

Derive the backpropagation recurrence. Starting from the loss $\mathcal{L} = L(z_L, y)$ and the layer equations $a_\ell = W_\ell z_{\ell-1} + b_\ell$ , $z_\ell = \sigma(a_\ell)$ , show that:

$\frac{\partial \mathcal{L}}{\partial a_\ell} = \left(W_{\ell+1}^\top \frac{\partial \mathcal{L}}{\partial a_{\ell+1}}\right) \odot \sigma'(a_\ell)$

ExerciseResearch

Problem

The universal approximation theorem guarantees a single hidden layer suffices for approximation. Why, then, do we use deep networks? Give a concrete example of a function family where depth gives an exponential advantage over width.

References

Canonical:

Linnainmaa, "The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors" (1970) -- reverse-mode AD
Werbos, "Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences" (PhD thesis, Harvard 1974) -- backprop applied to neural networks
Rumelhart, Hinton, Williams, "Learning Representations by Back-Propagating Errors" (Nature 1986) -- backprop popularization
Cybenko, "Approximation by Superpositions of a Sigmoidal Function" (Math. Control Signals Systems 1989) -- universal approximation for sigmoid
Hornik, Stinchcombe, White, "Multilayer Feedforward Networks are Universal Approximators" (Neural Networks 1989) -- general version
Hornik, "Approximation Capabilities of Multilayer Feedforward Networks" (Neural Networks 1991) -- extension to unbounded activations
Leshno, Lin, Pinkus, Schocken, "Multilayer Feedforward Networks with a Nonpolynomial Activation Function Can Approximate Any Function" (Neural Networks 1993) -- sharpest classical result
Barron, "Universal Approximation Bounds for Superpositions of a Sigmoidal Function" (IEEE TIT 1993) -- quantitative rate $O(1/\sqrt{N})$
Glorot, Bengio, "Understanding the Difficulty of Training Deep Feedforward Neural Networks" (AISTATS 2010) -- Xavier initialization

Current:

He, Zhang, Ren, Sun, "Delving Deep into Rectifiers" (ICCV 2015) -- He initialization
Goodfellow, Bengio, Courville, Deep Learning (2016), Chapters 6-8 -- feedforward networks and autodiff
Baydin, Pearlmutter, Radul, Siskind, "Automatic Differentiation in Machine Learning: a Survey" (JMLR 2018) -- forward-mode vs reverse-mode AD

Next Topics

Building on feedforward fundamentals:

Convolutional neural networks: exploiting spatial structure with weight sharing
Recurrent neural networks: extending feedforward networks to sequences
Batch normalization: stabilizing training by normalizing intermediate activations

Last reviewed: June 25, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

11

Differentiation in Rⁿlayer 0A · tier 1
Tensors and Tensor Operationslayer 0A · tier 1
Vector Calculus Chain Rulelayer 0A · tier 1
Deep Learning (Goodfellow, Bengio, Courville)layer 0B · tier 1
Activation Functionslayer 1 · tier 1

Derived topics

33

Batch Normalizationlayer 2 · tier 1
Dropoutlayer 2 · tier 1
Gradient Flow and Vanishing Gradientslayer 2 · tier 1
Linear Layer: Shapes, Bias, and Memorylayer 2 · tier 1
Skip Connections and ResNetslayer 2 · tier 1

+28 more on the derived-topics page.

Graph-backed continuations

Convolutional Neural Networks Recurrent Neural Networks Batch Normalization Activation Checkpointing Adversarial Machine Learning Autoencoders Bayesian Neural Networks Contrastive Learning Dropout Energy-Based Models Fine-Tuning and Adaptation Generative Adversarial Networks Gradient Flow and Vanishing Gradients Hebbian Learning Iterative Magnitude Pruning and the Lottery Ticket Hypothesis Knowledge Distillation Kolmogorov-Arnold Networks (KANs)Linear Layer: Shapes, Bias, and Memory Meta-Learning Mixture Density Networks Model Compression and Pruning Neural Architecture Search Occupancy Networks and Neural Fields Optimal Brain Surgeon and Pruning Theory Physics-Informed Neural Networks Quantization Theory Skip Connections and ResNets Spiking Neural Networks Token Prediction and Language Modeling Transfer Learning Transformer Architecture Universal Approximation Theorem Weight Initialization