ML Methods
Skip Connections and ResNets
Residual connections let gradients flow through identity paths, enabling training of very deep networks. ResNets learn residual functions F(x) = H(x) - x, which is easier than learning H(x) directly.
Prerequisites
Why This Matters
Before ResNets (He et al. 2015), training networks deeper than about 20 layers was unreliable. Adding more layers to a plain network actually increased training error, not just test error. This was not overfitting; it was an optimization failure.
The fix was simple: add the input of a block directly to its output. This single architectural change enabled training of networks with 100, 1000, and even 1200+ layers. ResNet won the 2015 ImageNet competition by a wide margin and became the default architecture for deep learning.
The Residual Block
Residual Block
A residual block computes:
where is a sequence of layers (typically conv-BN-ReLU-conv-BN) and is the input. The is the skip connection (or shortcut connection). The network learns the residual rather than the desired mapping directly.
When dimensions change (e.g., spatial downsampling or channel expansion), a linear projection is applied to the shortcut: .
Why Residual Learning Works
The core insight: if the identity mapping is close to optimal for some layer, then pushing is easier than learning from scratch with a stack of nonlinear layers. Residual learning biases the network toward identity-like functions, which provides a good default for deep layers.
Gradient Flow Through Residual Connections
Statement
Consider residual blocks in sequence. Let be the input and for . The gradient of the loss with respect to satisfies:
Expanding the product, there is always a direct path that passes through no weight layers.
Intuition
In a plain network, gradients must pass through every weight matrix, and the product of many matrices can vanish or explode. In a ResNet, the product always contains the term (the identity), which corresponds to the gradient flowing directly through all skip connections without attenuation.
Proof Sketch
By the chain rule, . Compose these Jacobians from layer to . Expanding the product of terms, the identity always survives as one term in the sum, giving an unattenuated gradient path.
Why It Matters
This explains why ResNets train successfully at extreme depths. The gradient does not need to survive multiplication by Jacobian matrices. There is always an identity shortcut. This is the precise mechanism by which skip connections solve the vanishing gradient problem.
Failure Mode
Skip connections do not solve all optimization problems. If has very large Jacobians, gradients can still explode. Batch normalization and careful initialization remain necessary. Skip connections also do not address the approximation question of whether depth actually helps for a given task.
Connection to Continuous Dynamics
ResNet as Euler Discretization
Statement
The residual update is the forward Euler discretization of the ODE:
with step size . In the limit of infinitely many layers with infinitesimal step size, a ResNet becomes a neural ODE.
Intuition
Each residual block is a small perturbation of the identity. Stacking many such blocks traces out a continuous trajectory through feature space. This perspective unifies ResNets with dynamical systems theory and led to Neural ODEs (Chen et al. 2018), which parameterize the dynamics directly.
Proof Sketch
Replace the discrete index with continuous time , and the update with . The forward Euler method discretizes this ODE as . Setting recovers the residual block.
Why It Matters
This connection brings ODE solvers, adjoint methods for memory-efficient backprop, and stability theory into the neural network toolkit. It also suggests that ResNets with many layers are implicitly performing numerical integration.
Failure Mode
The Euler discretization is first-order and can be unstable for stiff dynamics. In practice, ResNets do not use adaptive step sizes or higher-order integration methods. The ODE perspective is most useful as a conceptual framework, not a literal description of what finite-depth ResNets compute.
DenseNet: Dense Connections
DenseNet (Huang et al. 2017) extends the skip connection idea: instead of adding only the immediate input, each layer receives the concatenation of all preceding feature maps:
where denotes concatenation along the channel axis. This gives each layer direct access to all earlier features, encouraging feature reuse and reducing the total number of parameters needed.
Common Confusions
Skip connections do not mean the network ignores depth
The skip connection provides an identity path, but the network still learns through the nonlinear branch. The final output is , not just . Deep ResNets outperform shallow ones, showing that the terms contribute. The skip connection makes optimization feasible, not trivial.
Vanishing gradients vs degradation problem
Vanishing gradients cause slow training. The degradation problem is different: deeper plain networks achieve higher training error than shallower ones, even though the deeper network could, in principle, copy the shallower one and set extra layers to identity. Skip connections address both, but the degradation problem was the primary motivation in the original ResNet paper.
Canonical Examples
ResNet-50 block structure
ResNet-50 uses "bottleneck" blocks: 1x1 conv (reduce channels), 3x3 conv (spatial filtering), 1x1 conv (restore channels), plus the skip connection. For a block with input dimension 256 and bottleneck dimension 64: the 1x1 conv maps 256 to 64, the 3x3 conv operates on 64 channels, and the final 1x1 maps 64 back to 256. The skip adds the original 256-dim input to the output.
Exercises
Problem
Consider a 3-layer plain network where each layer multiplies gradients by 0.5 (due to saturation). What is the gradient at the input? Now add skip connections to make it a 3-block ResNet. What changes qualitatively?
Problem
Why does the ODE perspective suggest that ResNets with step size 1 might be suboptimal? What architectural modification does this suggest?
References
Canonical:
- He, Zhang, Ren, Sun, "Deep Residual Learning for Image Recognition" (CVPR 2016), Sections 1-3
- Huang, Liu, van der Maaten, Weinberger, "Densely Connected Convolutional Networks" (2017)
- Veit, Wilber, Belongie, "Residual Networks Behave Like Ensembles of Relatively Shallow Networks" (NeurIPS 2016)
Current:
- Chen, Rubanova, Bettencourt, Duvenaud, "Neural Ordinary Differential Equations" (NeurIPS 2018)
- He, Zhang, Ren, Sun, "Identity Mappings in Deep Residual Networks" (ECCV 2016). Pre-activation variant.
- Srivastava, Greff, Schmidhuber, "Highway Networks" (2015). Gated precursor to residual connections.
Next Topics
- Batch normalization: the companion technique that makes deep ResNets trainable
- Convolutional neural networks: the architecture family where ResNets had the biggest impact
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A