ML Methods
Convolutional Neural Networks
How weight sharing and local connectivity exploit spatial structure: convolution as cross-correlation, translation equivariance, pooling for approximate invariance, and the conv-pool-fc architecture.
Why This Matters
CNNs were the first deep learning architecture to achieve superhuman performance on a major benchmark (ImageNet, 2015). Their design encodes a powerful inductive bias: spatial structure matters, and the same pattern can appear anywhere in the image. Understanding why CNNs work. weight sharing, equivariance, local connectivity. gives you the template for designing architectures that exploit other symmetries.
Mental Model
A fully connected layer treats each input pixel as an independent feature with its own weight. A convolutional layer instead slides a small filter across the input, applying the same weights everywhere. This has two consequences: far fewer parameters (weight sharing), and the output shifts when the input shifts (translation equivariance). Pooling then coarsens the representation, providing approximate translation invariance.
The Convolution Operation
Discrete Convolution (Cross-Correlation)
In deep learning, "convolution" is technically cross-correlation. For a 2D input and kernel of size :
True convolution flips the kernel (), but since we learn the kernel, flipping is irrelevant. The optimizer simply learns the flipped version. The community uses "convolution" to mean cross-correlation. For large kernels, convolutions can be computed efficiently in the frequency domain using the fast Fourier transform.
Feature Map
Applying a kernel to input produces a feature map (or activation map). A convolutional layer has kernels, each of size , producing feature maps. Each kernel detects a different pattern (edge, texture, shape) across all spatial locations.
Weight Sharing and Parameter Counting
Weight Sharing
In a convolutional layer, the same kernel is applied at every spatial position. This is weight sharing: all positions share the same parameters.
For a layer with input channels, output channels, and kernel size , the parameter count is:
where the accounts for the bias per output channel. Compare this to a fully connected layer mapping the same spatial dimensions, which would have parameters. orders of magnitude more.
Translation Equivariance
Translation Equivariance of Convolution
Statement
Let denote translation by : . Then convolution commutes with translation:
That is, shifting the input and then convolving gives the same result as convolving and then shifting the output.
Intuition
If a cat detector fires at position , moving the cat to moves the detection to . The network does not need to relearn the cat detector for every position. one set of weights works everywhere.
Proof Sketch
. The sum is over the kernel indices, and the shift passes through linearly.
Why It Matters
Equivariance is the key inductive bias of CNNs. It means the network automatically generalizes across spatial positions without needing to see examples at every location. This is why CNNs need far less data than fully connected networks for image tasks.
Failure Mode
Equivariance is exact for continuous convolution but only approximate for discrete convolution with striding or padding. Stride-2 convolutions are equivariant only to even shifts. Boundary effects from padding also break exact equivariance at the edges.
Pooling
Pooling Layers
Pooling reduces spatial dimensions by summarizing local regions:
- Max pooling: . takes the maximum activation in each region
- Average pooling: . takes the mean
Pooling provides approximate translation invariance: small shifts in the input do not change the pooled output (as long as the maximum stays within the same pooling window). Note: pooling gives invariance, while convolution gives equivariance. these are different properties.
Receptive Field
Receptive Field
The receptive field of a neuron is the region of the input that can influence its value. In a CNN with layers of kernel size and stride 1:
Stacking small kernels (e.g., two layers have receptive field 5) is preferred over one large kernel () because it uses fewer parameters and adds more nonlinearity.
The Conv-Pool-FC Architecture
The classical CNN architecture follows a pattern:
- Convolutional blocks: alternating conv layers (with ReLU) and pooling layers. Spatial dimensions decrease while channel depth increases.
- Flatten: reshape the final feature maps into a 1D vector.
- Fully connected layers: one or more dense layers for classification.
Modern architectures (ResNet, etc.) add skip connections and batch normalization, but the core conv-pool structure remains.
Why CNNs Work for Images
CNNs succeed on images because of two structural assumptions about natural images that are approximately true:
- Locality: nearby pixels are more related than distant ones. Small kernels exploit this by looking at local neighborhoods.
- Translation symmetry: the same pattern (edge, texture, object part) can appear anywhere. Weight sharing exploits this.
These are inductive biases. assumptions baked into the architecture. When the assumptions match the data (images, audio spectrograms), CNNs dominate. When they do not (tabular data, graphs), CNNs offer no advantage. For graph-structured data, graph neural networks generalize the convolution idea to irregular topologies.
Connection to Group Equivariant Convolutions
Standard CNNs are equivariant to translations but not to rotations or reflections. Group equivariant CNNs generalize by replacing the translation group with a larger symmetry group :
This gives equivariance to the full group (e.g., rotations, reflections) by construction. Standard CNNs are the special case where is the translation group .
Canonical Examples
Parameter count: conv vs. fully connected
Consider a input (RGB image). A conv layer with 16 filters of size has parameters. A fully connected layer mapping the same input to 16 outputs would have parameters, 40 times more, and without translation equivariance.
Common Confusions
Convolution in deep learning is actually cross-correlation
Mathematical convolution flips the kernel before sliding; deep learning convolution does not. Since the kernel is learned, this distinction is irrelevant in practice. But it matters if you compare CNN operations to signal processing formulas.
Equivariance is not invariance
Convolution is translation equivariant: the output shifts with the input. Pooling provides approximate translation invariance: the output stays the same under small shifts. A classifier needs invariance (the label should not change); the internal representations should be equivariant (preserving spatial information until the final layers).
Summary
- CNN "convolution" is technically cross-correlation. kernel is not flipped
- Weight sharing means the same kernel is applied at every position, drastically reducing parameters
- Convolution is translation equivariant:
- Pooling provides approximate translation invariance
- Receptive field grows with depth: stacking small kernels is more efficient than large kernels
- CNNs encode the inductive bias that images have local structure and translation symmetry
Exercises
Problem
A convolutional layer has 64 input channels, 128 output channels, and kernels. How many parameters does it have (including biases)?
Problem
Prove that a stride-2 convolution is equivariant to translations by even numbers but not by odd numbers. What does this imply about the information lost by striding?
Related Comparisons
References
Canonical:
- LeCun et al., "Gradient-Based Learning Applied to Document Recognition" (1998)
- Goodfellow, Bengio, Courville, Deep Learning (2016), Chapter 9
Current:
-
Cohen & Welling, "Group Equivariant Convolutional Networks" (2016)
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
-
Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 1-28
Next Topics
The natural next steps from CNNs:
- Recurrent neural networks: handling sequential instead of spatial data
- Transformers: attention-based architectures that replace convolution
- Group equivariant networks: generalizing the equivariance principle
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A
- Vectors, Matrices, and Linear MapsLayer 0A
Builds on This
- AlexNet and Deep Learning HistoryLayer 3
- Equivariant Deep LearningLayer 4
- Graph Neural NetworksLayer 3
- Object Detection and SegmentationLayer 3
- Vision Transformer LineageLayer 4