Skip Connections and ResNets

Sneiderman, Robby

ML Methods

Skip Connections and ResNets

Residual connections let gradients flow through identity paths, enabling training of very deep networks. ResNets learn residual functions F(x) = H(x) - x, which is easier than learning H(x) directly.

CoreTier 1StableSupporting~45 min

Prerequisites

Feedforward Networks and Backpropagation

Start 8-question practice · 1 available Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 2 | tier 1. This page has 1 direct prerequisite and 4 published dependents.

Open Atlas Prerequisites Leads to

What next

Batch Normalization

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Before ResNets (He et al. 2015), training networks deeper than about 20 layers was unreliable. Adding more layers to a plain network actually increased training error, not just test error. This was not overfitting; it was an optimization failure.

The fix was simple: add the input of a block directly to its output. This single architectural change enabled training of networks with 100, 1000, and even 1200+ layers. ResNet won the 2015 ImageNet competition by a wide margin and became the default architecture for deep learning.

The Residual Block

Definition

Residual Block $y = F (x) + x$

A residual block computes:

$y = F(x) + x$

where $F$ is a sequence of layers (typically conv-BN-ReLU-conv-BN) and $x$ is the input. The $+ x$ is the skip connection (or shortcut connection). The network learns the residual $F(x) = H(x) - x$ rather than the desired mapping $H(x)$ directly.

When dimensions change (e.g., spatial downsampling or channel expansion), a linear projection $W_s$ is applied to the shortcut: $y = F(x) + W_s x$ .

Why Residual Learning Works

The core insight: if the identity mapping $H(x) = x$ is close to optimal for some layer, then pushing $F(x) \to 0$ is easier than learning $H(x) = x$ from scratch with a stack of nonlinear layers. Residual learning biases the network toward identity-like functions, which provides a good default for deep layers.

Proposition

Gradient Flow Through Residual Connections

Statement

Consider $L$ residual blocks in sequence. Let $x_0$ be the input and $x_l = x_{l-1} + F_l(x_{l-1})$ for $l = 1, \ldots, L$ . The gradient of the loss $\mathcal{L}$ with respect to $x_l$ satisfies:

$\frac{\partial \mathcal{L}}{\partial x_l} = \frac{\partial \mathcal{L}}{\partial x_L} \prod_{k=l+1}^{L} \left(I + \frac{\partial F_k}{\partial x_k}\right)$

Expanding the product, there is always a direct path $\frac{\partial \mathcal{L}}{\partial x_L}$ that passes through no weight layers.

Intuition

In a plain network, gradients must pass through every weight matrix, and the product of many matrices can vanish or explode. In a ResNet, the product $(I + J_1)(I + J_2) \cdots (I + J_k)$ always contains the term $I$ (the identity), which corresponds to the gradient flowing directly through all skip connections without attenuation.

Proof Sketch

By the chain rule, $\frac{\partial x_{l+1}}{\partial x_l} = I + \frac{\partial F_{l+1}}{\partial x_l}$ . Compose these Jacobians from layer $l$ to $L$ . Expanding the product of $(I + J_k)$ terms, the identity $I^L = I$ always survives as one term in the sum, giving an unattenuated gradient path.

Why It Matters

This explains why ResNets train successfully at extreme depths. The gradient does not need to survive multiplication by $L$ Jacobian matrices. There is always an identity shortcut. This is the precise mechanism by which skip connections solve the vanishing gradient problem.

Failure Mode

Skip connections do not solve all optimization problems. If $F_l$ has very large Jacobians, gradients can still explode. Batch normalization and careful initialization remain necessary. Skip connections also do not address the approximation question of whether depth actually helps for a given task.

report a correction →

Connection to Continuous Dynamics

Proposition

ResNet as Euler Discretization

Statement

The residual update $x_{l+1} = x_l + F_\theta(x_l, l)$ is the forward Euler discretization of the ODE:

$\frac{dx}{dt} = F_\theta(x(t), t)$

with step size $\Delta t = 1$ . In the limit of infinitely many layers with infinitesimal step size, a ResNet becomes a neural ODE.

Intuition

Each residual block is a small perturbation of the identity. Stacking many such blocks traces out a continuous trajectory through feature space. This perspective unifies ResNets with dynamical systems theory and led to Neural ODEs (Chen et al. 2018), which parameterize the dynamics directly.

Proof Sketch

Replace the discrete index $l$ with continuous time $t$ , and the update $x_{l+1} - x_l = F(x_l)$ with $dx/dt = F(x(t), t)$ . The forward Euler method discretizes this ODE as $x(t + \Delta t) \approx x(t) + \Delta t \cdot F(x(t), t)$ . Setting $\Delta t = 1$ recovers the residual block.

Why It Matters

This connection brings ODE solvers, adjoint methods for memory-efficient backprop, and stability theory into the neural network toolkit. It also suggests that ResNets with many layers are implicitly performing numerical integration.

Failure Mode

The Euler discretization is first-order and can be unstable for stiff dynamics. In practice, ResNets do not use adaptive step sizes or higher-order integration methods. The ODE perspective is most useful as a conceptual framework, not a literal description of what finite-depth ResNets compute.

report a correction →

Identity Shortcut vs Projection Shortcut

When the dimensions of $x$ and $F(x)$ match, the skip connection is a literal identity: $y = F(x) + x$ . When they do not match — because the block changes spatial resolution (strided convolution) or channel count — the shortcut must transform $x$ to make the addition well-defined.

Two strategies appear in the original ResNet paper:

Option A: Zero-padding shortcut. Pad the shortcut with zeros to increase its channel count. No parameters. The spatial downsampling is handled by striding the skip path. This introduces no new learned parameters and keeps the identity gradient property intact.

Option B: Projection shortcut. Apply a $1 \times 1$ convolution $W_s$ to the shortcut: $y = F(x) + W_s x$ . This learns the dimension-matching transformation. The Jacobian of the shortcut path becomes $W_s$ instead of $I$ .

He et al. (2015) compared three variants:

Use zero-padding everywhere (A).
Use projection only when dimensions change (B).
Use projection everywhere (C).

Variant B outperforms A, and C is marginally better than B, but the added computation of C rarely justifies the difference. The standard choice is B: projection shortcuts only at dimension-changing blocks.

The gradient flow implications differ. With an identity shortcut, the backward pass sees $\partial x_{l+1}/\partial x_l = I + \partial F_l/\partial x_l$ , which always includes the identity. With a projection shortcut, it becomes $W_s + \partial F_l/\partial x_l$ . If $W_s$ is well-conditioned and not too small, the gradient still flows; but the identity bypass is no longer exact. This is why gradient flow and vanishing gradients analysis distinguishes the two cases.

In practice, projection shortcuts also interact with batch normalization: the BN layer applied after the shortcut $1 \times 1$ conv normalizes the shortcut signal before addition, which stabilizes training at depth.

Pre-activation ResNets (He et al. ECCV 2016) move the BN-ReLU sequence before the weight layers rather than after: the block becomes $y = x + W_2 \cdot \text{BN}(\text{ReLU}(\text{BN}(W_1 x)))$ . This places a clean identity shortcut in the forward path without any activation applied to it, which improves gradient flow for very deep networks (1001 layers in the paper).

DenseNet: Dense Connections

DenseNet (Huang et al. 2017) extends the skip connection idea: instead of adding only the immediate input, each layer receives the concatenation of all preceding feature maps:

$x_l = H_l([x_0, x_1, \ldots, x_{l-1}])$

where $[\cdot]$ denotes concatenation along the channel axis. This gives each layer direct access to all earlier features, encouraging feature reuse and reducing the total number of parameters needed.

Common Confusions

Watch Out

Skip connections do not mean the network ignores depth

The skip connection provides an identity path, but the network still learns $F(x)$ through the nonlinear branch. The final output is $F(x) + x$ , not just $x$ . Deep ResNets outperform shallow ones, showing that the $F(x)$ terms contribute. The skip connection makes optimization feasible, not trivial.

Watch Out

Vanishing gradients vs degradation problem

Vanishing gradients cause slow training. The degradation problem is different: deeper plain networks achieve higher training error than shallower ones, even though the deeper network could, in principle, copy the shallower one and set extra layers to identity. Skip connections address both, but the degradation problem was the primary motivation in the original ResNet paper.

Canonical Examples

Example

ResNet-50 block structure

ResNet-50 uses "bottleneck" blocks: 1x1 conv (reduce channels), 3x3 conv (spatial filtering), 1x1 conv (restore channels), plus the skip connection. For a block with input dimension 256 and bottleneck dimension 64: the 1x1 conv maps 256 to 64, the 3x3 conv operates on 64 channels, and the final 1x1 maps 64 back to 256. The skip adds the original 256-dim input to the output.

Exercises

ExerciseCore

Problem

Consider a 3-layer plain network where each layer multiplies gradients by 0.5 (due to saturation). What is the gradient at the input? Now add skip connections to make it a 3-block ResNet. What changes qualitatively?

ExerciseAdvanced

Problem

Why does the ODE perspective suggest that ResNets with step size 1 might be suboptimal? What architectural modification does this suggest?

References

Canonical:

He, Zhang, Ren, Sun, "Deep Residual Learning for Image Recognition" (CVPR 2016), Sections 1-3 (residual learning, shortcut types, ablations)
He, Zhang, Ren, Sun, "Identity Mappings in Deep Residual Networks" (ECCV 2016), Section 3 (pre-activation block, gradient analysis)
Huang, Liu, van der Maaten, Weinberger, "Densely Connected Convolutional Networks" (CVPR 2017), Section 3

Background:

Srivastava, Greff, Schmidhuber, "Highway Networks" (2015) — gated precursor to residual connections
Goodfellow, Bengio, Courville, Deep Learning (2016), Chapter 6.3 (gradient pathology) and Chapter 9 (convolutional architectures)

Analysis:

Veit, Wilber, Belongie, "Residual Networks Behave Like Ensembles of Relatively Shallow Networks" (NeurIPS 2016)
Chen, Rubanova, Bettencourt, Duvenaud, "Neural Ordinary Differential Equations" (NeurIPS 2018), Section 2 (Euler discretization interpretation)

Next Topics

Batch normalization: the companion technique that makes deep ResNets trainable
Convolutional neural networks: the architecture family where ResNets had the biggest impact

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Feedforward Networks and Backpropagationlayer 2 · tier 1

Derived topics

4

Batch Normalizationlayer 2 · tier 1
Convolutional Neural Networkslayer 3 · tier 2
Equilibrium and Implicit-Layer Modelslayer 4 · tier 2
Neural ODEs and Continuous-Depth Networkslayer 4 · tier 3

Graph-backed continuations

Batch Normalization Convolutional Neural Networks Equilibrium and Implicit-Layer Models Neural ODEs and Continuous-Depth Networks