ML Methods
AlexNet and Deep Learning History
AlexNet (2012) proved deep CNNs work at scale on real vision tasks, reigniting deep learning. Key innovations: GPU training, ReLU, dropout, data augmentation. The path from AlexNet through VGGNet, GoogLeNet, ResNet, to vision transformers.
Prerequisites
Why This Matters
Before 2012, the computer vision community was dominated by hand-engineered features (SIFT, HOG) fed into shallow classifiers (SVMs). Neural networks existed but were considered impractical for large-scale vision. AlexNet changed this by winning the ImageNet Large Scale Visual Recognition Challenge (ILSVRC 2012) with a top-5 error rate of 15.3%, compared to 26.2% for the second-place entry using traditional methods. This 10.8 percentage point gap was unprecedented.
Understanding AlexNet matters not for the specific architecture (which is now obsolete) but for the design principles it validated: depth, scale, GPU computation, and specific regularization techniques.
The ImageNet Context
ImageNet (Deng et al. 2009) contains 1.2 million training images across 1000 classes. Before AlexNet, the annual ILSVRC winners improved by roughly 1-2 percentage points per year using feature engineering. AlexNet showed that a single architecture change could deliver a decade of incremental progress in one step.
AlexNet Architecture (Krizhevsky, Sutskever, Hinton, 2012)
The network has 5 convolutional layers followed by 3 fully connected layers, totaling approximately 60 million parameters.
AlexNet Design Choices
The key departures from prior CNNs:
- ReLU activation: instead of sigmoid or tanh. Trains 6x faster due to non-saturating gradient.
- GPU training: split across two GTX 580 GPUs (3GB each). This was an engineering innovation as much as a scientific one.
- Dropout with in the fully connected layers.
- Data augmentation: random crops, horizontal flips, PCA-based color perturbation.
- Local response normalization (LRN): lateral inhibition across feature maps. Later shown to be unnecessary.
Why ReLU Mattered
ReLU Enables Deeper Training
Statement
For a sigmoid network with layers, the gradient of the loss with respect to weights in layer scales as where for the logistic sigmoid. For and , this gives a factor of roughly .
For ReLU, when , so gradients do not shrink multiplicatively through layers (though they can vanish if many units are in the regime).
Intuition
Sigmoid squashes all inputs to , and its derivative peaks at 0.25. Stacking many sigmoid layers multiplies many numbers less than 1, shrinking gradients exponentially. ReLU passes gradients through unchanged for positive inputs, removing this multiplicative decay.
Proof Sketch
By the chain rule, . For sigmoid, each , giving geometric decay. For ReLU, , so active paths have no decay.
Why It Matters
This gradient flow analysis explains why networks before AlexNet were limited to 2-3 layers in practice. ReLU (along with careful initialization) enabled training 8-layer networks. Later innovations like batch normalization and skip connections pushed this further to hundreds of layers.
Failure Mode
ReLU has the "dying ReLU" problem: if a unit's pre-activation is always negative, it outputs zero and receives zero gradient, permanently deactivating. Variants like Leaky ReLU ( with small ) and GELU address this.
What Followed AlexNet
Each subsequent ImageNet winner addressed a specific limitation:
VGGNet (Simonyan and Zisserman, 2014): Showed that depth matters. Used only convolutions, stacked to 16-19 layers. A stack of two convolutions has the same receptive field as one but fewer parameters and more nonlinearities.
GoogLeNet/Inception (Szegedy et al. 2015): Introduced inception modules: parallel convolutions at multiple scales (, , ) concatenated together. Used convolutions for dimensionality reduction. 22 layers, but only 5M parameters (vs. VGG's 138M).
ResNet (He et al. 2016): Skip connections allow training 152+ layer networks. The key insight: learning a residual is easier than learning directly when the optimal mapping is close to identity. Won ILSVRC 2015 with 3.57% top-5 error, surpassing human performance (estimated at 5.1%).
What AlexNet Taught the Field
The lasting impact of AlexNet was not any single architectural innovation but the demonstration of a general principle: scale and compute trump clever feature engineering. This principle, later articulated as the Bitter Lesson (Sutton, 2019), drove the next decade of deep learning research.
The pre-AlexNet paradigm
The 2011 ILSVRC winner used the following pipeline:
- Extract dense SIFT descriptors at multiple scales
- Encode descriptors using Fisher vectors (a higher-order extension of bag-of-visual-words)
- Apply spatial pyramids to capture layout information
- Classify with a linear SVM trained on the encoded features
Each step represented years of computer vision research. The entire pipeline was hand-designed and required deep domain expertise. AlexNet replaced all of steps 1-3 with learned convolutional features and still won by a 10.8% margin. The message was clear: end-to-end learning from raw pixels outperforms hand-crafted features when data and compute are sufficient.
Three specific lessons from AlexNet persisted:
Activation functions matter for depth. ReLU was not novel (Nair and Hinton, 2010), but AlexNet showed its practical importance at scale. The shift from saturating activations (sigmoid, tanh) to non-saturating activations (ReLU) enabled training networks deeper than 5 layers. Later work on activation functions (PReLU, ELU, GELU, SiLU) continued this line, but the core insight was established by AlexNet.
Regularization is not optional. AlexNet used dropout () in the fully connected layers, random crops, horizontal flips, and PCA-based color jittering. Without these, the 60M-parameter model would have memorized the 1.2M training images. Every subsequent architecture paper includes a regularization section that traces back to AlexNet's choices.
Hardware constrains architecture. AlexNet was split across two GTX 580 GPUs because no single GPU had enough memory. The inter-GPU communication pattern (groups of feature maps processed independently on each GPU, with cross-GPU connections only at certain layers) was driven by hardware limits, not by any principled architectural choice. This pragmatic approach to hardware constraints became standard: modern architectures are co-designed with the available hardware (tensor cores, high-bandwidth memory, inter-node communication).
The Path to Vision Transformers
The progression from AlexNet to ViT (Dosovitskiy et al. 2020) follows a clear arc: each step reduced the inductive bias hardcoded into the architecture.
- AlexNet/VGGNet: strong locality and translation equivariance via convolutions
- GoogLeNet: multi-scale processing within each layer
- ResNet: identity shortcuts allow information to skip layers
- ViT: remove convolutions entirely, treat image patches as tokens
ViT showed that with sufficient data (300M+ images), a transformer with minimal vision-specific inductive bias matches or exceeds CNNs. This raises a theoretical question: is the CNN inductive bias helpful, or merely a useful prior when data is scarce?
Canonical Examples
Parameter count comparison
AlexNet: ~60M parameters for 8 layers. VGGNet-16: ~138M parameters for 16 layers. GoogLeNet: ~5M parameters for 22 layers. ResNet-152: ~60M parameters for 152 layers.
GoogLeNet achieves better accuracy than VGG with 28x fewer parameters. This demonstrates that architecture design (inception modules, convolutions) can be far more efficient than simply stacking layers.
Common Confusions
AlexNet did not invent CNNs or GPU training
LeNet-5 (LeCun et al. 1998) used CNNs for digit recognition. Ciresan et al. (2011) trained CNNs on GPUs before AlexNet. AlexNet's contribution was demonstrating that these ideas work at ImageNet scale with specific architectural choices (ReLU, dropout, data augmentation) that together produced a qualitative leap in performance.
Deeper is not always better without architectural support
Simply adding layers to a VGG-style network degrades performance due to optimization difficulty (not overfitting). ResNet's skip connections solved this specific problem. Depth helps only when the optimization landscape permits gradient flow.
Exercises
Problem
A sigmoid network has 10 layers. Estimate the magnitude of the gradient at layer 1 relative to layer 10, assuming the sigmoid derivative is at most 0.25 everywhere.
Problem
Two convolutions applied sequentially have the same receptive field as a single convolution. Compare the parameter counts (ignoring bias) for input and output channels in both cases.
References
Canonical:
- Krizhevsky, Sutskever, Hinton, "ImageNet Classification with Deep Convolutional Neural Networks", NeurIPS 2012
- He, Zhang, Ren, Sun, "Deep Residual Learning for Image Recognition", CVPR 2016
Current:
- Dosovitskiy et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", ICLR 2021
- Simonyan and Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition", ICLR 2015
- Szegedy et al., "Going Deeper with Convolutions", CVPR 2015
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Convolutional Neural NetworksLayer 3
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A
- Vectors, Matrices, and Linear MapsLayer 0A