Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Feedforward Networks and Backpropagation

Feedforward neural networks as compositions of affine transforms and nonlinearities, the universal approximation theorem, and backpropagation as reverse-mode automatic differentiation on the computational graph.

CoreTier 1Stable~60 min

Why This Matters

W₁W₂W₃Input xh₁ = σ(W₁x)h₂ = σ(W₂h₁)ŷ = W₃h₂Loss L∂L/∂ŷ∂L/∂h₂∂L/∂h₁∂L/∂xForward pass: compute activations left to rightBackward pass: propagate gradients right to left via chain rule∂L/∂W₁ = ∂L/∂h₁ · ∂h₁/∂W₁ = (∂L/∂ŷ · ∂ŷ/∂h₂ · ∂h₂/∂h₁) · σ'(W₁x) · x

Feedforward networks are the foundation of modern deep learning. The history of deep learning traces back through multiple waves of enthusiasm and disillusionment, but every architecture -- convolutional networks, transformers, residual networks -- is built from the same primitive: layers of affine transformations followed by nonlinear activation functions. Understanding this primitive deeply -- what it can represent, how to train it, and what goes wrong -- is essential before studying any specialized architecture.

Backpropagation is the algorithm that makes training possible. It is not a learning algorithm itself -- it is an efficient method for computing gradients. Without backpropagation, training a network with nn parameters would require O(n)O(n) forward passes to compute the gradient (one per parameter). Backpropagation computes the full gradient in a single backward pass, making deep learning computationally feasible.

Mental Model

A feedforward network is a pipeline: input flows through a series of layers, each applying a linear transformation (matrix multiply + bias) followed by a nonlinearity (ReLU, sigmoid, tanh). The output of each layer is the input to the next. Training means adjusting the matrices and biases so that the final output matches the desired labels.

Backpropagation computes how much each weight contributed to the error by propagating the error signal backward through the network, using the chain rule at each layer. It is reverse-mode automatic differentiation applied to the computational graph of the network.

Feedforward Architecture

Definition

Feedforward Neural Network

An LL-layer feedforward neural network computes:

f(x;θ)=WLσ(WL1σ(σ(W1x+b1))+bL1)+bLf(x; \theta) = W_L \sigma(W_{L-1} \sigma(\cdots \sigma(W_1 x + b_1) \cdots) + b_{L-1}) + b_L

where WRd×d1W_\ell \in \mathbb{R}^{d_{\ell} \times d_{\ell-1}} are weight matrices, bRdb_\ell \in \mathbb{R}^{d_\ell} are bias vectors, and σ()\sigma(\cdot) is a nonlinear activation function applied elementwise. The parameters are θ={W1,b1,,WL,bL}\theta = \{W_1, b_1, \ldots, W_L, b_L\}.

Layer by layer, the computation is:

a=Wz1+b(pre-activation)a_\ell = W_\ell z_{\ell-1} + b_\ell \qquad \text{(pre-activation)} z=σ(a)(post-activation)z_\ell = \sigma(a_\ell) \qquad \text{(post-activation)}

where z0=xz_0 = x is the input and zL=f(x;θ)z_L = f(x; \theta) is the output (the final layer may omit the nonlinearity for regression, or use softmax for classification).

Definition

Width and Depth

The width of layer \ell is dd_\ell (the number of neurons). The depth of the network is LL (the number of layers). A "deep" network has many layers; a "wide" network has many neurons per layer. Depth enables hierarchical feature composition; width enables representing more features at each level.

Universal Approximation

Theorem

Universal Approximation Theorem

Statement

For any continuous function g:KRg: K \to \mathbb{R} on a compact set KRdK \subset \mathbb{R}^d and any ϵ>0\epsilon > 0, there exists a single-hidden-layer network f(x)=j=1Nαjσ(wjx+bj)f(x) = \sum_{j=1}^N \alpha_j \sigma(w_j^\top x + b_j) such that:

supxKf(x)g(x)<ϵ\sup_{x \in K} |f(x) - g(x)| < \epsilon

That is, feedforward networks with one hidden layer are dense in the space of continuous functions on compact sets, in the supremum norm.

Intuition

A single hidden layer with enough neurons can approximate any continuous function to arbitrary accuracy. Each neuron σ(wjx+bj)\sigma(w_j^\top x + b_j) carves out a half-space in input space. With enough half-spaces, you can build any shape. Think of it as approximation by a very flexible linear combination of basis functions.

Proof Sketch

The original proof by Cybenko (1989) for sigmoid activations uses the Hahn-Banach theorem and the Riesz representation theorem. The idea: suppose the set of functions representable by the network is not dense. Then by Hahn-Banach, there exists a nonzero bounded linear functional that vanishes on all such functions. This functional corresponds to a signed measure μ\mu. Show that σ(wx+b)dμ(x)=0\int \sigma(w^\top x + b) \, d\mu(x) = 0 for all w,bw, b implies μ=0\mu = 0, giving a contradiction.

Why It Matters

Universal approximation tells you that neural networks are expressive enough -- the hypothesis class is rich enough to approximate any target. But it says absolutely nothing about:

  • How to find the weights: the optimization landscape is non-convex
  • How many neurons are needed: the width NN may need to be exponentially large in dd
  • Whether the learned function generalizes: expressiveness is about approximation error, not estimation error

This is an existence theorem, not a constructive one. It guarantees the capacity is there but gives no guidance on learning.

Failure Mode

The theorem requires arbitrary width. For fixed-width networks, depth matters: there exist functions that require exponential width with bounded depth but only polynomial width with sufficient depth. This is the theoretical motivation for deep (many-layer) networks over wide shallow ones.

Backpropagation

Backpropagation computes the gradient θL\nabla_\theta \mathcal{L} of the loss with respect to all parameters. It is the chain rule applied systematically from output to input.

Forward pass: compute and store all pre-activations aa_\ell and post-activations zz_\ell for =1,,L\ell = 1, \ldots, L.

Backward pass: starting from the loss gradient δL=L/aL\delta_L = \partial \mathcal{L} / \partial a_L, propagate backward:

δ=(W+1δ+1)σ(a)\delta_\ell = (W_{\ell+1}^\top \delta_{\ell+1}) \odot \sigma'(a_\ell)

where \odot denotes elementwise multiplication and σ(a)\sigma'(a_\ell) is the derivative of the activation at the pre-activation values.

The parameter gradients at layer \ell are:

LW=δz1,Lb=δ\frac{\partial \mathcal{L}}{\partial W_\ell} = \delta_\ell z_{\ell-1}^\top, \qquad \frac{\partial \mathcal{L}}{\partial b_\ell} = \delta_\ell

Proposition

Backpropagation Complexity

Statement

For a feedforward network with nn total parameters, backpropagation computes the full gradient θL\nabla_\theta \mathcal{L} in O(n)O(n) time -- the same order as a single forward pass. Computing each partial derivative individually by finite differences would require O(n)O(n) forward passes, giving O(n2)O(n^2) total time.

Intuition

The key insight is that backpropagation reuses intermediate computations. The error signal δ\delta_\ell at layer \ell is computed from δ+1\delta_{\ell+1} by a single matrix-vector multiply. Each layer adds O(dd1)O(d_\ell d_{\ell-1}) work -- the same as the forward pass through that layer. Since the forward pass is O(n)O(n) (summing over all layer sizes), the backward pass is also O(n)O(n).

This is not a coincidence. Backpropagation is reverse-mode automatic differentiation: it computes all nn partial derivatives simultaneously by traversing the computational graph once in reverse.

Proof Sketch

The forward pass through layer \ell computes a=Wz1+ba_\ell = W_\ell z_{\ell-1} + b_\ell (cost O(dd1)O(d_\ell d_{\ell-1})) and z=σ(a)z_\ell = \sigma(a_\ell) (cost O(d)O(d_\ell)). Total forward cost: O(dd1)=O(n)\sum_\ell O(d_\ell d_{\ell-1}) = O(n).

The backward pass at layer \ell computes δ=(W+1δ+1)σ(a)\delta_\ell = (W_{\ell+1}^\top \delta_{\ell+1}) \odot \sigma'(a_\ell) (cost O(d+1d)O(d_{\ell+1} d_\ell)) and L/W=δz1\partial \mathcal{L}/\partial W_\ell = \delta_\ell z_{\ell-1}^\top (cost O(dd1)O(d_\ell d_{\ell-1})). Total backward cost: O(n)O(n).

Why It Matters

This linear-time complexity is what makes deep learning possible. A modern language model has billions of parameters. Computing the gradient in O(n)O(n) rather than O(n2)O(n^2) is the difference between training in weeks and never training at all.

Failure Mode

Backpropagation requires storing all intermediate activations zz_\ell for the backward pass. This is the memory bottleneck of training: memory scales linearly with depth and batch size. Gradient checkpointing trades computation for memory by recomputing activations during the backward pass instead of storing them.

Vanishing and Exploding Gradients

The backward recurrence δ=(W+1δ+1)σ(a)\delta_\ell = (W_{\ell+1}^\top \delta_{\ell+1}) \odot \sigma'(a_\ell) multiplies by σ(a)\sigma'(a_\ell) at each layer. Over LL layers:

δ1δL=1L1W+1σ(a)\|\delta_1\| \approx \|\delta_L\| \prod_{\ell=1}^{L-1} \|W_{\ell+1}^\top\| \cdot |\sigma'(a_\ell)|

  • If Wσ<1\|W_\ell^\top\| \cdot |\sigma'| < 1 at most layers, the product shrinks exponentially: vanishing gradients. Early layers learn extremely slowly.
  • If Wσ>1\|W_\ell^\top\| \cdot |\sigma'| > 1 at most layers, the product grows exponentially: exploding gradients. Training becomes unstable.

Activation Functions

Definition

ReLU

The Rectified Linear Unit is σ(a)=max(0,a)\sigma(a) = \max(0, a). Its derivative is σ(a)=1\sigma'(a) = 1 for a>0a > 0 and σ(a)=0\sigma'(a) = 0 for a<0a < 0. ReLU does not saturate for positive inputs, which mitigates vanishing gradients. However, neurons with a<0a < 0 have zero gradient ("dying ReLU").

Sigmoid: σ(a)=1/(1+ea)\sigma(a) = 1/(1 + e^{-a}), with σ=σ(1σ)0.25\sigma' = \sigma(1 - \sigma) \leq 0.25. The maximum derivative is 0.25, so gradients shrink by at least a factor of 4 per layer. This causes severe vanishing gradients in deep networks.

Tanh: σ(a)=tanh(a)\sigma(a) = \tanh(a), with σ=1tanh2(a)1\sigma' = 1 - \tanh^2(a) \leq 1. Better than sigmoid (centered at zero, larger maximum derivative) but still saturates for large a|a|.

ReLU is the default activation for hidden layers in modern networks because it avoids the saturation problem entirely for positive inputs.

Weight Initialization

Proper initialization ensures that the variance of activations and gradients remains roughly constant across layers at the start of training.

Definition

Xavier (Glorot) Initialization

For a layer with dind_{\text{in}} inputs and doutd_{\text{out}} outputs, Xavier initialization sets:

WijN ⁣(0,2din+dout)W_{ij} \sim \mathcal{N}\!\left(0, \frac{2}{d_{\text{in}} + d_{\text{out}}}\right)

This is derived by requiring Var(z)=Var(z1)\text{Var}(z_\ell) = \text{Var}(z_{\ell-1}) for linear activations (σ(a)=a\sigma(a) = a). It works well with tanh and sigmoid activations.

Definition

He Initialization

For layers using ReLU, He initialization sets:

WijN ⁣(0,2din)W_{ij} \sim \mathcal{N}\!\left(0, \frac{2}{d_{\text{in}}}\right)

The factor of 2 accounts for the fact that ReLU zeros out half of the activations (those with a<0a < 0), which halves the variance. Without this correction, activations shrink toward zero in deeper layers.

Canonical Examples

Example

Two-layer network for XOR

The XOR function y=x1x2y = x_1 \oplus x_2 is not linearly separable. A two-layer network with 2 hidden neurons can learn it exactly:

  • Hidden layer: h1=ReLU(x1+x20.5)h_1 = \text{ReLU}(x_1 + x_2 - 0.5), h2=ReLU(x1x2+1.5)h_2 = \text{ReLU}(-x_1 - x_2 + 1.5)
  • Output: y=h1+h21y = h_1 + h_2 - 1

The hidden layer creates a new representation where XOR becomes linearly separable. This is the fundamental power of neural networks: learning useful intermediate representations.

Example

Vanishing gradient with sigmoid

Consider a 10-layer network with sigmoid activations. At each layer, the gradient is multiplied by σ(a)0.25\sigma'(a) \leq 0.25. After 10 layers, the gradient at the first layer is at most (0.25)93.8×106(0.25)^9 \approx 3.8 \times 10^{-6} times the gradient at the output. With ReLU and proper initialization, the gradient passes through unchanged (for active neurons), preserving the learning signal across all layers.

Common Confusions

Watch Out

Universal approximation does not mean easy to train

The universal approximation theorem is an existence result: a wide enough network can approximate any function. But finding the right weights via gradient descent on a non-convex loss landscape is a completely separate problem. The loss may have many local minima, saddle points, and flat regions. In practice, overparameterized networks are easier to optimize (the loss landscape becomes more benign), but this is an empirical observation, not a consequence of the approximation theorem.

Watch Out

Backpropagation is not a learning algorithm

Backpropagation computes gradients. That is all it does. The learning algorithm is gradient descent (or Adam, or SGD with momentum, etc.), which uses the gradients that backpropagation provides. Saying "we train with backpropagation" is technically imprecise -- you train with gradient descent, and backpropagation is the subroutine that makes gradient computation efficient.

Watch Out

Deep networks are not just wide networks stacked

Adding depth is qualitatively different from adding width. Depth enables compositional representations: early layers learn simple features, later layers compose them into complex features. Width at a single layer enables representing more features at one level of abstraction. There are functions (like parity) that require exponential width with bounded depth but only polynomial size with sufficient depth.

Summary

  • A feedforward network is a composition of affine transforms + nonlinearities
  • Universal approximation: one hidden layer suffices for approximation, but says nothing about learning or generalization
  • Backpropagation = reverse-mode autodiff = chain rule applied backward through the computational graph
  • Backprop computes the full gradient in O(n)O(n) time, same as a forward pass
  • Vanishing gradients: sigmoid/tanh derivatives shrink the gradient exponentially with depth
  • ReLU avoids saturation for positive inputs; it is the default activation
  • Xavier initialization preserves variance for linear/tanh; He initialization corrects for ReLU zeroing half the activations

Exercises

ExerciseCore

Problem

Consider a 3-layer network with input xR2x \in \mathbb{R}^2, hidden layer sizes d1=4,d2=3d_1 = 4, d_2 = 3, and scalar output. How many parameters (weights and biases) does this network have?

ExerciseAdvanced

Problem

Derive the backpropagation recurrence. Starting from the loss L=L(zL,y)\mathcal{L} = L(z_L, y) and the layer equations a=Wz1+ba_\ell = W_\ell z_{\ell-1} + b_\ell, z=σ(a)z_\ell = \sigma(a_\ell), show that:

La=(W+1La+1)σ(a)\frac{\partial \mathcal{L}}{\partial a_\ell} = \left(W_{\ell+1}^\top \frac{\partial \mathcal{L}}{\partial a_{\ell+1}}\right) \odot \sigma'(a_\ell)

ExerciseResearch

Problem

The universal approximation theorem guarantees a single hidden layer suffices for approximation. Why, then, do we use deep networks? Give a concrete example of a function family where depth gives an exponential advantage over width.

References

Canonical:

  • Cybenko, "Approximation by Superpositions of a Sigmoidal Function" (1989) -- universal approximation for sigmoid
  • Hornik, Stinchcombe, White, "Multilayer Feedforward Networks are Universal Approximators" (1989) -- general version
  • Rumelhart, Hinton, Williams, "Learning Representations by Back-Propagating Errors" (1986) -- backpropagation
  • Glorot & Bengio, "Understanding the Difficulty of Training Deep Feedforward Neural Networks" (2010) -- Xavier initialization

Current:

  • He, Zhang, Ren, Sun, "Delving Deep into Rectifiers" (2015) -- He initialization
  • Goodfellow, Bengio, Courville, Deep Learning (2016), Chapters 6-8

Next Topics

Building on feedforward fundamentals:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics