Skip to main content

Paper breakdown

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun · 2015 · CVPR 2016

Introduces residual connections — adding the input of a block to its output — to enable training of networks more than an order of magnitude deeper than was previously stable. Wins ImageNet 2015 with a 152-layer model.

Overview

He, Zhang, Ren, and Sun (2015) reported a phenomenon they called the degradation problem. As deep convolutional networks were stacked beyond about 20 layers, training error went up, not down. This was not overfitting — the deeper network had higher training error too — and it was not vanishing gradients in the usual sense, because batch normalization was already in use. The networks simply could not be optimised.

The paper's solution is one line of arithmetic. Replace each block's mapping H(x)H(x) with F(x)+xF(x) + x, where FF is what the block parameterises and +x+x is an identity shortcut. The block now learns the residual F(x)=H(x)xF(x) = H(x) - x rather than the full transformation. If the optimal HH is close to the identity — which it often is in early training — then optimising FF to be close to zero is easier than optimising the un-shortcutted version to be close to the identity.

ResNet-152 won ILSVRC 2015 with 3.57% top-5 error. The architecture, with minor changes, is still the default convolutional backbone in 2026, and the residual connection itself appears in every transformer.

Mathematical Contributions

The residual block

A plain block computes y=H(x)y = H(x) for some learned HH. The residual block computes:

y=F(x;{Wi})+xy = F(x; \{W_i\}) + x

where FF is typically two or three convolutional layers with batch normalization and ReLU. The +x+x shortcut is an identity map when input and output dimensions match; when they differ, the shortcut is a 1×11 \times 1 projection WsxW_s x. The full network composes residual blocks: xl+1=Fl(xl)+xlx_{l+1} = F_l(x_l) + x_l. Unrolling gives:

xL=x0+l=0L1Fl(xl)x_L = x_0 + \sum_{l=0}^{L-1} F_l(x_l)

So the deep representation is a sum of contributions from every block, not a long chain of multiplied transformations. This is the key structural property.

Gradient signal across depth

By the chain rule, the gradient of the loss L\mathcal{L} with respect to an early input x0x_0 is:

Lx0=LxL ⁣(I+l=0L1Flxl)\frac{\partial \mathcal{L}}{\partial x_0} = \frac{\partial \mathcal{L}}{\partial x_L}\!\left(I + \sum_{l=0}^{L-1} \frac{\partial F_l}{\partial x_l}\right)

The identity term II is the load-bearing piece. Even if the FlF_l Jacobians vanish or explode, the II ensures the gradient signal at x0x_0 has a non-zero baseline equal to L/xL\partial \mathcal{L} / \partial x_L. This is why residual networks train at depths where plain networks diverge. See gradient flow and vanishing gradients for the full treatment.

The degradation experiment

The paper trains a 20-layer and a 56-layer plain CNN on CIFAR-10. The 56-layer network has higher training and test error than the 20-layer one. With residual connections (ResNet-20 and ResNet-56), the deeper network has lower error. This rules out optimisation difficulty caused by overfitting and isolates the degradation to a representational/optimisation interaction that the shortcut fixes.

Bottleneck blocks

For ResNet-50/101/152, each block is a three-layer bottleneck: 1×11 \times 1 conv reducing channels, 3×33 \times 3 conv at reduced width, 1×11 \times 1 conv expanding channels. This keeps the parameter count manageable at depth and is the form that propagated into later architectures.

Connections to TheoremPath Topics

Why It Matters Now

The residual connection is the single most-copied idea in modern deep-learning architecture. Every transformer block — encoder, decoder, attention, MLP — is wrapped in a residual: xx+Attention(x)x \to x + \text{Attention}(x) then xx+MLP(x)x \to x + \text{MLP}(x). The mechanistic-interpretability literature calls the running sum the residual stream and treats it as the network's working memory. None of that exists without this paper.

The argument also matured the field's view of depth. Before ResNet, "make it deeper" was sometimes worse. After ResNet, depth was a free parameter constrained mostly by compute. This rescaling unlocked the 1000-layer-equivalent transformer stacks that GPT-3 and successors required.

The paper is also a good methodology lesson: the key contribution is one architectural change, defended with a single decisive experiment (degradation on plain vs residual nets). It is a short paper that does not overclaim.

References

Canonical:

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep Residual Learning for Image Recognition." CVPR. arXiv:1512.03385.
  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Identity Mappings in Deep Residual Networks." ECCV. arXiv:1603.05027. The pre-activation refinement.

Direct precursors:

  • Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). "Highway Networks." arXiv:1505.00387. Gated shortcuts; ResNet replaces the gate with identity.
  • Ioffe, S., & Szegedy, S. (2015). "Batch Normalization." ICML. arXiv:1502.03167.

Theoretical analyses of residual networks:

  • Hardt, M., & Ma, T. (2017). "Identity Matters in Deep Learning." ICLR. arXiv:1611.04231.
  • Veit, A., Wilber, M., & Belongie, S. (2016). "Residual Networks Behave Like Ensembles of Relatively Shallow Networks." NeurIPS. arXiv:1605.06431.

Architectural descendants:

  • Huang, G., Liu, Z., van der Maaten, L., & Weinberger, K. Q. (2017). "Densely Connected Convolutional Networks." CVPR. arXiv:1608.06993.
  • Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS. arXiv:1706.03762. Residual + LayerNorm wraps every transformer sublayer.

Standard textbook:

  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapters 8.7, 9.10.

Connected topics

Last reviewed: May 5, 2026