ML Methods
Feedforward Networks and Backpropagation
Feedforward neural networks as compositions of affine transforms and nonlinearities, the universal approximation theorem, and backpropagation as reverse-mode automatic differentiation on the computational graph.
Why This Matters
Feedforward networks are the foundation of modern deep learning. The history of deep learning traces back through multiple waves of enthusiasm and disillusionment, but every architecture -- convolutional networks, transformers, residual networks -- is built from the same primitive: layers of affine transformations followed by nonlinear activation functions. Understanding this primitive deeply -- what it can represent, how to train it, and what goes wrong -- is essential before studying any specialized architecture.
Backpropagation is the algorithm that makes training possible. It is not a learning algorithm itself -- it is an efficient method for computing gradients. Without backpropagation, training a network with parameters would require forward passes to compute the gradient (one per parameter). Backpropagation computes the full gradient in a single backward pass, making deep learning computationally feasible.
Mental Model
A feedforward network is a pipeline: input flows through a series of layers, each applying a linear transformation (matrix multiply + bias) followed by a nonlinearity (ReLU, sigmoid, tanh). The output of each layer is the input to the next. Training means adjusting the matrices and biases so that the final output matches the desired labels.
Backpropagation computes how much each weight contributed to the error by propagating the error signal backward through the network, using the chain rule at each layer. It is reverse-mode automatic differentiation applied to the computational graph of the network.
Feedforward Architecture
Feedforward Neural Network
An -layer feedforward neural network computes:
where are weight matrices, are bias vectors, and is a nonlinear activation function applied elementwise. The parameters are .
Layer by layer, the computation is:
where is the input and is the output (the final layer may omit the nonlinearity for regression, or use softmax for classification).
Width and Depth
The width of layer is (the number of neurons). The depth of the network is (the number of layers). A "deep" network has many layers; a "wide" network has many neurons per layer. Depth enables hierarchical feature composition; width enables representing more features at each level.
Universal Approximation
Universal Approximation Theorem
Statement
For any continuous function on a compact set and any , there exists a single-hidden-layer network such that:
That is, feedforward networks with one hidden layer are dense in the space of continuous functions on compact sets, in the supremum norm.
Intuition
A single hidden layer with enough neurons can approximate any continuous function to arbitrary accuracy. Each neuron carves out a half-space in input space. With enough half-spaces, you can build any shape. Think of it as approximation by a very flexible linear combination of basis functions.
Proof Sketch
The original proof by Cybenko (1989) for sigmoid activations uses the Hahn-Banach theorem and the Riesz representation theorem. The idea: suppose the set of functions representable by the network is not dense. Then by Hahn-Banach, there exists a nonzero bounded linear functional that vanishes on all such functions. This functional corresponds to a signed measure . Show that for all implies , giving a contradiction.
Why It Matters
Universal approximation tells you that neural networks are expressive enough -- the hypothesis class is rich enough to approximate any target. But it says absolutely nothing about:
- How to find the weights: the optimization landscape is non-convex
- How many neurons are needed: the width may need to be exponentially large in
- Whether the learned function generalizes: expressiveness is about approximation error, not estimation error
This is an existence theorem, not a constructive one. It guarantees the capacity is there but gives no guidance on learning.
Failure Mode
The theorem requires arbitrary width. For fixed-width networks, depth matters: there exist functions that require exponential width with bounded depth but only polynomial width with sufficient depth. This is the theoretical motivation for deep (many-layer) networks over wide shallow ones.
Backpropagation
Backpropagation computes the gradient of the loss with respect to all parameters. It is the chain rule applied systematically from output to input.
Forward pass: compute and store all pre-activations and post-activations for .
Backward pass: starting from the loss gradient , propagate backward:
where denotes elementwise multiplication and is the derivative of the activation at the pre-activation values.
The parameter gradients at layer are:
Backpropagation Complexity
Statement
For a feedforward network with total parameters, backpropagation computes the full gradient in time -- the same order as a single forward pass. Computing each partial derivative individually by finite differences would require forward passes, giving total time.
Intuition
The key insight is that backpropagation reuses intermediate computations. The error signal at layer is computed from by a single matrix-vector multiply. Each layer adds work -- the same as the forward pass through that layer. Since the forward pass is (summing over all layer sizes), the backward pass is also .
This is not a coincidence. Backpropagation is reverse-mode automatic differentiation: it computes all partial derivatives simultaneously by traversing the computational graph once in reverse.
Proof Sketch
The forward pass through layer computes (cost ) and (cost ). Total forward cost: .
The backward pass at layer computes (cost ) and (cost ). Total backward cost: .
Why It Matters
This linear-time complexity is what makes deep learning possible. A modern language model has billions of parameters. Computing the gradient in rather than is the difference between training in weeks and never training at all.
Failure Mode
Backpropagation requires storing all intermediate activations for the backward pass. This is the memory bottleneck of training: memory scales linearly with depth and batch size. Gradient checkpointing trades computation for memory by recomputing activations during the backward pass instead of storing them.
Vanishing and Exploding Gradients
The backward recurrence multiplies by at each layer. Over layers:
- If at most layers, the product shrinks exponentially: vanishing gradients. Early layers learn extremely slowly.
- If at most layers, the product grows exponentially: exploding gradients. Training becomes unstable.
Activation Functions
ReLU
The Rectified Linear Unit is . Its derivative is for and for . ReLU does not saturate for positive inputs, which mitigates vanishing gradients. However, neurons with have zero gradient ("dying ReLU").
Sigmoid: , with . The maximum derivative is 0.25, so gradients shrink by at least a factor of 4 per layer. This causes severe vanishing gradients in deep networks.
Tanh: , with . Better than sigmoid (centered at zero, larger maximum derivative) but still saturates for large .
ReLU is the default activation for hidden layers in modern networks because it avoids the saturation problem entirely for positive inputs.
Weight Initialization
Proper initialization ensures that the variance of activations and gradients remains roughly constant across layers at the start of training.
Xavier (Glorot) Initialization
For a layer with inputs and outputs, Xavier initialization sets:
This is derived by requiring for linear activations (). It works well with tanh and sigmoid activations.
He Initialization
For layers using ReLU, He initialization sets:
The factor of 2 accounts for the fact that ReLU zeros out half of the activations (those with ), which halves the variance. Without this correction, activations shrink toward zero in deeper layers.
Canonical Examples
Two-layer network for XOR
The XOR function is not linearly separable. A two-layer network with 2 hidden neurons can learn it exactly:
- Hidden layer: ,
- Output:
The hidden layer creates a new representation where XOR becomes linearly separable. This is the fundamental power of neural networks: learning useful intermediate representations.
Vanishing gradient with sigmoid
Consider a 10-layer network with sigmoid activations. At each layer, the gradient is multiplied by . After 10 layers, the gradient at the first layer is at most times the gradient at the output. With ReLU and proper initialization, the gradient passes through unchanged (for active neurons), preserving the learning signal across all layers.
Common Confusions
Universal approximation does not mean easy to train
The universal approximation theorem is an existence result: a wide enough network can approximate any function. But finding the right weights via gradient descent on a non-convex loss landscape is a completely separate problem. The loss may have many local minima, saddle points, and flat regions. In practice, overparameterized networks are easier to optimize (the loss landscape becomes more benign), but this is an empirical observation, not a consequence of the approximation theorem.
Backpropagation is not a learning algorithm
Backpropagation computes gradients. That is all it does. The learning algorithm is gradient descent (or Adam, or SGD with momentum, etc.), which uses the gradients that backpropagation provides. Saying "we train with backpropagation" is technically imprecise -- you train with gradient descent, and backpropagation is the subroutine that makes gradient computation efficient.
Deep networks are not just wide networks stacked
Adding depth is qualitatively different from adding width. Depth enables compositional representations: early layers learn simple features, later layers compose them into complex features. Width at a single layer enables representing more features at one level of abstraction. There are functions (like parity) that require exponential width with bounded depth but only polynomial size with sufficient depth.
Summary
- A feedforward network is a composition of affine transforms + nonlinearities
- Universal approximation: one hidden layer suffices for approximation, but says nothing about learning or generalization
- Backpropagation = reverse-mode autodiff = chain rule applied backward through the computational graph
- Backprop computes the full gradient in time, same as a forward pass
- Vanishing gradients: sigmoid/tanh derivatives shrink the gradient exponentially with depth
- ReLU avoids saturation for positive inputs; it is the default activation
- Xavier initialization preserves variance for linear/tanh; He initialization corrects for ReLU zeroing half the activations
Exercises
Problem
Consider a 3-layer network with input , hidden layer sizes , and scalar output. How many parameters (weights and biases) does this network have?
Problem
Derive the backpropagation recurrence. Starting from the loss and the layer equations , , show that:
Problem
The universal approximation theorem guarantees a single hidden layer suffices for approximation. Why, then, do we use deep networks? Give a concrete example of a function family where depth gives an exponential advantage over width.
References
Canonical:
- Cybenko, "Approximation by Superpositions of a Sigmoidal Function" (1989) -- universal approximation for sigmoid
- Hornik, Stinchcombe, White, "Multilayer Feedforward Networks are Universal Approximators" (1989) -- general version
- Rumelhart, Hinton, Williams, "Learning Representations by Back-Propagating Errors" (1986) -- backpropagation
- Glorot & Bengio, "Understanding the Difficulty of Training Deep Feedforward Neural Networks" (2010) -- Xavier initialization
Current:
- He, Zhang, Ren, Sun, "Delving Deep into Rectifiers" (2015) -- He initialization
- Goodfellow, Bengio, Courville, Deep Learning (2016), Chapters 6-8
Next Topics
Building on feedforward fundamentals:
- Convolutional neural networks: exploiting spatial structure with weight sharing
- Recurrent neural networks: extending feedforward networks to sequences
- Batch normalization: stabilizing training by normalizing intermediate activations
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A
Builds on This
- Activation CheckpointingLayer 3
- Adversarial Machine LearningLayer 4
- AutoencodersLayer 2
- Batch NormalizationLayer 2
- Bayesian Neural NetworksLayer 3
- Contrastive LearningLayer 3
- Convolutional Neural NetworksLayer 3
- DropoutLayer 2
- Fine-Tuning and AdaptationLayer 3
- Generative Adversarial NetworksLayer 3
- Gradient Flow and Vanishing GradientsLayer 2
- Knowledge DistillationLayer 3
- Meta-LearningLayer 3
- Mixture Density NetworksLayer 3
- Model Compression and PruningLayer 3
- Neural Architecture SearchLayer 4
- Occupancy Networks and Neural FieldsLayer 4
- Optimal Brain Surgery and Pruning TheoryLayer 3
- Recurrent Neural NetworksLayer 3
- Skip Connections and ResNetsLayer 2
- Transfer LearningLayer 3
- Transformer ArchitectureLayer 4
- Universal Approximation TheoremLayer 2
- Weight InitializationLayer 2