Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Convolutional Neural Networks

How weight sharing and local connectivity exploit spatial structure: convolution as cross-correlation, translation equivariance, pooling for approximate invariance, and the conv-pool-fc architecture.

AdvancedTier 2Stable~55 min

Why This Matters

Feature Hierarchy: Simple → ComplexSmall receptive fieldLarge receptive fieldLayer 1Edges & gradientsLayer 2-3Textures & patternsLayer 4-6Parts & shapesLayer 7+Objects & concepts🐱🚗🏠🌳3 channels64 channels256 channels512 channels

CNNs were the first deep learning architecture to achieve superhuman performance on a major benchmark (ImageNet, 2015). Their design encodes a powerful inductive bias: spatial structure matters, and the same pattern can appear anywhere in the image. Understanding why CNNs work. weight sharing, equivariance, local connectivity. gives you the template for designing architectures that exploit other symmetries.

Mental Model

A fully connected layer treats each input pixel as an independent feature with its own weight. A convolutional layer instead slides a small filter across the input, applying the same weights everywhere. This has two consequences: far fewer parameters (weight sharing), and the output shifts when the input shifts (translation equivariance). Pooling then coarsens the representation, providing approximate translation invariance.

The Convolution Operation

Definition

Discrete Convolution (Cross-Correlation)

In deep learning, "convolution" is technically cross-correlation. For a 2D input ff and kernel kk of size m×mm \times m:

(fk)[i,j]=a=0m1b=0m1f[i+a,j+b]k[a,b](f \star k)[i, j] = \sum_{a=0}^{m-1} \sum_{b=0}^{m-1} f[i+a, j+b] \cdot k[a, b]

True convolution flips the kernel (k[m1a,m1b]k[m-1-a, m-1-b]), but since we learn the kernel, flipping is irrelevant. The optimizer simply learns the flipped version. The community uses "convolution" to mean cross-correlation. For large kernels, convolutions can be computed efficiently in the frequency domain using the fast Fourier transform.

Definition

Feature Map

Applying a kernel kk to input ff produces a feature map (or activation map). A convolutional layer has CoutC_{\text{out}} kernels, each of size Cin×m×mC_{\text{in}} \times m \times m, producing CoutC_{\text{out}} feature maps. Each kernel detects a different pattern (edge, texture, shape) across all spatial locations.

Weight Sharing and Parameter Counting

Definition

Weight Sharing

In a convolutional layer, the same kernel is applied at every spatial position. This is weight sharing: all positions share the same parameters.

For a layer with CinC_{\text{in}} input channels, CoutC_{\text{out}} output channels, and kernel size m×mm \times m, the parameter count is:

Cout×(Cin×m×m+1)C_{\text{out}} \times (C_{\text{in}} \times m \times m + 1)

where the +1+1 accounts for the bias per output channel. Compare this to a fully connected layer mapping the same spatial dimensions, which would have Cout×H×W×Cin×H×WC_{\text{out}} \times H \times W \times C_{\text{in}} \times H \times W parameters. orders of magnitude more.

Translation Equivariance

Proposition

Translation Equivariance of Convolution

Statement

Let TτT_\tau denote translation by τ\tau: (Tτf)[i,j]=f[iτ1,jτ2](T_\tau f)[i,j] = f[i - \tau_1, j - \tau_2]. Then convolution commutes with translation:

Tτ(fk)=(Tτf)kT_\tau(f \star k) = (T_\tau f) \star k

That is, shifting the input and then convolving gives the same result as convolving and then shifting the output.

Intuition

If a cat detector fires at position (i,j)(i, j), moving the cat to (i+5,j+3)(i+5, j+3) moves the detection to (i+5,j+3)(i+5, j+3). The network does not need to relearn the cat detector for every position. one set of weights works everywhere.

Proof Sketch

(Tτfk)[i,j]=a,bf[i+aτ1,j+bτ2]k[a,b]=(fk)[iτ1,jτ2]=Tτ(fk)[i,j](T_\tau f \star k)[i,j] = \sum_{a,b} f[i+a-\tau_1, j+b-\tau_2] \cdot k[a,b] = (f \star k)[i - \tau_1, j - \tau_2] = T_\tau(f \star k)[i,j]. The sum is over the kernel indices, and the shift passes through linearly.

Why It Matters

Equivariance is the key inductive bias of CNNs. It means the network automatically generalizes across spatial positions without needing to see examples at every location. This is why CNNs need far less data than fully connected networks for image tasks.

Failure Mode

Equivariance is exact for continuous convolution but only approximate for discrete convolution with striding or padding. Stride-2 convolutions are equivariant only to even shifts. Boundary effects from padding also break exact equivariance at the edges.

Pooling

Definition

Pooling Layers

Pooling reduces spatial dimensions by summarizing local regions:

  • Max pooling: pool(R)=max(i,j)Rf[i,j]\text{pool}(R) = \max_{(i,j) \in R} f[i,j]. takes the maximum activation in each region
  • Average pooling: pool(R)=1R(i,j)Rf[i,j]\text{pool}(R) = \frac{1}{|R|}\sum_{(i,j) \in R} f[i,j] . takes the mean

Pooling provides approximate translation invariance: small shifts in the input do not change the pooled output (as long as the maximum stays within the same pooling window). Note: pooling gives invariance, while convolution gives equivariance. these are different properties.

Receptive Field

Definition

Receptive Field

The receptive field of a neuron is the region of the input that can influence its value. In a CNN with LL layers of kernel size mm and stride 1:

receptive field size=L(m1)+1\text{receptive field size} = L(m - 1) + 1

Stacking small kernels (e.g., two 3×33 \times 3 layers have receptive field 5) is preferred over one large kernel (5×55 \times 5) because it uses fewer parameters and adds more nonlinearity.

The Conv-Pool-FC Architecture

The classical CNN architecture follows a pattern:

  1. Convolutional blocks: alternating conv layers (with ReLU) and pooling layers. Spatial dimensions decrease while channel depth increases.
  2. Flatten: reshape the final feature maps into a 1D vector.
  3. Fully connected layers: one or more dense layers for classification.

Modern architectures (ResNet, etc.) add skip connections and batch normalization, but the core conv-pool structure remains.

Why CNNs Work for Images

CNNs succeed on images because of two structural assumptions about natural images that are approximately true:

  1. Locality: nearby pixels are more related than distant ones. Small kernels exploit this by looking at local neighborhoods.
  2. Translation symmetry: the same pattern (edge, texture, object part) can appear anywhere. Weight sharing exploits this.

These are inductive biases. assumptions baked into the architecture. When the assumptions match the data (images, audio spectrograms), CNNs dominate. When they do not (tabular data, graphs), CNNs offer no advantage. For graph-structured data, graph neural networks generalize the convolution idea to irregular topologies.

Connection to Group Equivariant Convolutions

Standard CNNs are equivariant to translations but not to rotations or reflections. Group equivariant CNNs generalize by replacing the translation group with a larger symmetry group GG:

(fGk)(g)=hGf(h)k(g1h)(f \star_G k)(g) = \sum_{h \in G} f(h) \cdot k(g^{-1}h)

This gives equivariance to the full group GG (e.g., rotations, reflections) by construction. Standard CNNs are the special case where GG is the translation group Z2\mathbb{Z}^2.

Canonical Examples

Example

Parameter count: conv vs. fully connected

Consider a 32×32×332 \times 32 \times 3 input (RGB image). A conv layer with 16 filters of size 5×55 \times 5 has 16×(3×25+1)=1,21616 \times (3 \times 25 + 1) = 1{,}216 parameters. A fully connected layer mapping the same input to 16 outputs would have 16×(32×32×3)+16=49,16816 \times (32 \times 32 \times 3) + 16 = 49{,}168 parameters, 40 times more, and without translation equivariance.

Common Confusions

Watch Out

Convolution in deep learning is actually cross-correlation

Mathematical convolution flips the kernel before sliding; deep learning convolution does not. Since the kernel is learned, this distinction is irrelevant in practice. But it matters if you compare CNN operations to signal processing formulas.

Watch Out

Equivariance is not invariance

Convolution is translation equivariant: the output shifts with the input. Pooling provides approximate translation invariance: the output stays the same under small shifts. A classifier needs invariance (the label should not change); the internal representations should be equivariant (preserving spatial information until the final layers).

Summary

  • CNN "convolution" is technically cross-correlation. kernel is not flipped
  • Weight sharing means the same kernel is applied at every position, drastically reducing parameters
  • Convolution is translation equivariant: Tτ(fk)=(Tτf)kT_\tau(f \star k) = (T_\tau f) \star k
  • Pooling provides approximate translation invariance
  • Receptive field grows with depth: stacking small kernels is more efficient than large kernels
  • CNNs encode the inductive bias that images have local structure and translation symmetry

Exercises

ExerciseCore

Problem

A convolutional layer has 64 input channels, 128 output channels, and 3×33 \times 3 kernels. How many parameters does it have (including biases)?

ExerciseAdvanced

Problem

Prove that a stride-2 convolution is equivariant to translations by even numbers but not by odd numbers. What does this imply about the information lost by striding?

Related Comparisons

References

Canonical:

  • LeCun et al., "Gradient-Based Learning Applied to Document Recognition" (1998)
  • Goodfellow, Bengio, Courville, Deep Learning (2016), Chapter 9

Current:

  • Cohen & Welling, "Group Equivariant Convolutional Networks" (2016)

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

  • Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 1-28

Next Topics

The natural next steps from CNNs:

  • Recurrent neural networks: handling sequential instead of spatial data
  • Transformers: attention-based architectures that replace convolution
  • Group equivariant networks: generalizing the equivariance principle

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics