Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Skip Connections and ResNets

Residual connections let gradients flow through identity paths, enabling training of very deep networks. ResNets learn residual functions F(x) = H(x) - x, which is easier than learning H(x) directly.

CoreTier 1Stable~45 min

Why This Matters

Before ResNets (He et al. 2015), training networks deeper than about 20 layers was unreliable. Adding more layers to a plain network actually increased training error, not just test error. This was not overfitting; it was an optimization failure.

The fix was simple: add the input of a block directly to its output. This single architectural change enabled training of networks with 100, 1000, and even 1200+ layers. ResNet won the 2015 ImageNet competition by a wide margin and became the default architecture for deep learning.

xConv + BN+ ReLUConv + BNF(x)identity (skip)+F(x)+xthen ReLUGradient flows through the skip path even when F(x) gradients vanish. This is why ResNets train at 100+ layers.

The Residual Block

Definition

Residual Block

A residual block computes:

y=F(x)+xy = F(x) + x

where FF is a sequence of layers (typically conv-BN-ReLU-conv-BN) and xx is the input. The +x+ x is the skip connection (or shortcut connection). The network learns the residual F(x)=H(x)xF(x) = H(x) - x rather than the desired mapping H(x)H(x) directly.

When dimensions change (e.g., spatial downsampling or channel expansion), a linear projection WsW_s is applied to the shortcut: y=F(x)+Wsxy = F(x) + W_s x.

Why Residual Learning Works

The core insight: if the identity mapping H(x)=xH(x) = x is close to optimal for some layer, then pushing F(x)0F(x) \to 0 is easier than learning H(x)=xH(x) = x from scratch with a stack of nonlinear layers. Residual learning biases the network toward identity-like functions, which provides a good default for deep layers.

Proposition

Gradient Flow Through Residual Connections

Statement

Consider LL residual blocks in sequence. Let x0x_0 be the input and xl=xl1+Fl(xl1)x_l = x_{l-1} + F_l(x_{l-1}) for l=1,,Ll = 1, \ldots, L. The gradient of the loss L\mathcal{L} with respect to xlx_l satisfies:

Lxl=LxLk=l+1L(I+Fkxk)\frac{\partial \mathcal{L}}{\partial x_l} = \frac{\partial \mathcal{L}}{\partial x_L} \prod_{k=l+1}^{L} \left(I + \frac{\partial F_k}{\partial x_k}\right)

Expanding the product, there is always a direct path LxL\frac{\partial \mathcal{L}}{\partial x_L} that passes through no weight layers.

Intuition

In a plain network, gradients must pass through every weight matrix, and the product of many matrices can vanish or explode. In a ResNet, the product (I+J1)(I+J2)(I+Jk)(I + J_1)(I + J_2) \cdots (I + J_k) always contains the term II (the identity), which corresponds to the gradient flowing directly through all skip connections without attenuation.

Proof Sketch

By the chain rule, xl+1xl=I+Fl+1xl\frac{\partial x_{l+1}}{\partial x_l} = I + \frac{\partial F_{l+1}}{\partial x_l}. Compose these Jacobians from layer ll to LL. Expanding the product of (I+Jk)(I + J_k) terms, the identity IL=II^L = I always survives as one term in the sum, giving an unattenuated gradient path.

Why It Matters

This explains why ResNets train successfully at extreme depths. The gradient does not need to survive multiplication by LL Jacobian matrices. There is always an identity shortcut. This is the precise mechanism by which skip connections solve the vanishing gradient problem.

Failure Mode

Skip connections do not solve all optimization problems. If FlF_l has very large Jacobians, gradients can still explode. Batch normalization and careful initialization remain necessary. Skip connections also do not address the approximation question of whether depth actually helps for a given task.

Connection to Continuous Dynamics

Proposition

ResNet as Euler Discretization

Statement

The residual update xl+1=xl+Fθ(xl,l)x_{l+1} = x_l + F_\theta(x_l, l) is the forward Euler discretization of the ODE:

dxdt=Fθ(x(t),t)\frac{dx}{dt} = F_\theta(x(t), t)

with step size Δt=1\Delta t = 1. In the limit of infinitely many layers with infinitesimal step size, a ResNet becomes a neural ODE.

Intuition

Each residual block is a small perturbation of the identity. Stacking many such blocks traces out a continuous trajectory through feature space. This perspective unifies ResNets with dynamical systems theory and led to Neural ODEs (Chen et al. 2018), which parameterize the dynamics directly.

Proof Sketch

Replace the discrete index ll with continuous time tt, and the update xl+1xl=F(xl)x_{l+1} - x_l = F(x_l) with dx/dt=F(x(t),t)dx/dt = F(x(t), t). The forward Euler method discretizes this ODE as x(t+Δt)x(t)+ΔtF(x(t),t)x(t + \Delta t) \approx x(t) + \Delta t \cdot F(x(t), t). Setting Δt=1\Delta t = 1 recovers the residual block.

Why It Matters

This connection brings ODE solvers, adjoint methods for memory-efficient backprop, and stability theory into the neural network toolkit. It also suggests that ResNets with many layers are implicitly performing numerical integration.

Failure Mode

The Euler discretization is first-order and can be unstable for stiff dynamics. In practice, ResNets do not use adaptive step sizes or higher-order integration methods. The ODE perspective is most useful as a conceptual framework, not a literal description of what finite-depth ResNets compute.

DenseNet: Dense Connections

DenseNet (Huang et al. 2017) extends the skip connection idea: instead of adding only the immediate input, each layer receives the concatenation of all preceding feature maps:

xl=Hl([x0,x1,,xl1])x_l = H_l([x_0, x_1, \ldots, x_{l-1}])

where [][\cdot] denotes concatenation along the channel axis. This gives each layer direct access to all earlier features, encouraging feature reuse and reducing the total number of parameters needed.

Common Confusions

Watch Out

Skip connections do not mean the network ignores depth

The skip connection provides an identity path, but the network still learns F(x)F(x) through the nonlinear branch. The final output is F(x)+xF(x) + x, not just xx. Deep ResNets outperform shallow ones, showing that the F(x)F(x) terms contribute. The skip connection makes optimization feasible, not trivial.

Watch Out

Vanishing gradients vs degradation problem

Vanishing gradients cause slow training. The degradation problem is different: deeper plain networks achieve higher training error than shallower ones, even though the deeper network could, in principle, copy the shallower one and set extra layers to identity. Skip connections address both, but the degradation problem was the primary motivation in the original ResNet paper.

Canonical Examples

Example

ResNet-50 block structure

ResNet-50 uses "bottleneck" blocks: 1x1 conv (reduce channels), 3x3 conv (spatial filtering), 1x1 conv (restore channels), plus the skip connection. For a block with input dimension 256 and bottleneck dimension 64: the 1x1 conv maps 256 to 64, the 3x3 conv operates on 64 channels, and the final 1x1 maps 64 back to 256. The skip adds the original 256-dim input to the output.

Exercises

ExerciseCore

Problem

Consider a 3-layer plain network where each layer multiplies gradients by 0.5 (due to saturation). What is the gradient at the input? Now add skip connections to make it a 3-block ResNet. What changes qualitatively?

ExerciseAdvanced

Problem

Why does the ODE perspective suggest that ResNets with step size 1 might be suboptimal? What architectural modification does this suggest?

References

Canonical:

  • He, Zhang, Ren, Sun, "Deep Residual Learning for Image Recognition" (CVPR 2016), Sections 1-3
  • Huang, Liu, van der Maaten, Weinberger, "Densely Connected Convolutional Networks" (2017)
  • Veit, Wilber, Belongie, "Residual Networks Behave Like Ensembles of Relatively Shallow Networks" (NeurIPS 2016)

Current:

  • Chen, Rubanova, Bettencourt, Duvenaud, "Neural Ordinary Differential Equations" (NeurIPS 2018)
  • He, Zhang, Ren, Sun, "Identity Mappings in Deep Residual Networks" (ECCV 2016). Pre-activation variant.
  • Srivastava, Greff, Schmidhuber, "Highway Networks" (2015). Gated precursor to residual connections.

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics