LLM Construction
Knowledge Distillation
Training a small student model to mimic a large teacher: soft targets, temperature scaling, dark knowledge, and why the teacher's mistakes carry useful information about class structure.
Prerequisites
Why This Matters
Large models (billions of parameters) achieve strong performance but are expensive to serve. A 175B parameter model requires multiple GPUs for inference, making it impractical for edge and on-device ML deployment. Knowledge distillation trains a small "student" model to approximate a large "teacher" model, often recovering most of the teacher's accuracy at a fraction of the cost.
Distillation is not just compression. The student trained on teacher outputs often outperforms an identical architecture trained only on hard labels. The teacher's soft probability distribution over classes contains information that hard labels discard.
Formal Setup
Let be a teacher network and a student network. Both map inputs to logit vectors for classes.
Temperature-Scaled Softmax
For logits and temperature :
At , this is the standard softmax. As , the distribution approaches uniform. As , it approaches a one-hot vector at the argmax.
Distillation Loss
The distillation loss combines two terms:
where and are the teacher and student softmax outputs at temperature , is the hard label, is the KL divergence, and balances the two terms. The factor compensates for the reduced gradient magnitudes at high temperature.
Why Soft Targets Help: Dark Knowledge
Consider a teacher trained on CIFAR-10. For an image of a car, the teacher might output: car 0.9, truck 0.05, airplane 0.03, ship 0.02. The hard label says only "car." But the teacher's soft distribution encodes that trucks look more like cars than airplanes do, and airplanes look more like cars than ships do.
This relational information between classes is what Hinton et al. (2015) call dark knowledge. It is invisible in the hard labels but present in the teacher's soft outputs. High temperature amplifies these small probabilities, making the dark knowledge more accessible to the student.
Distillation Gradient at High Temperature
Statement
At high temperature for all logits, the gradient of the KL divergence term with respect to the student logits approximates:
That is, the distillation loss encourages the student logits to match the teacher logits, weighted by . The prefactor in the distillation loss cancels this, giving gradients of order .
Intuition
At high temperature, softmax becomes nearly linear: . The KL divergence between two nearly-uniform distributions reduces to a squared difference between the logit vectors. Distillation at high temperature is approximately matching teacher logits, which is a regression problem. This is gentler than matching sharp distributions and provides richer gradient signal.
Proof Sketch
Taylor expand around for large : . The softmax becomes where is the mean logit. Substitute into the KL formula and differentiate.
Why It Matters
This explains the factor in the loss and why moderate temperatures () work best. Too low: the student just sees near-one-hot targets. Too high: all class information is washed out. The sweet spot preserves the ranking and relative magnitudes of teacher logits.
Failure Mode
The high-temperature approximation breaks when some logits are much larger than . For confident teacher predictions (one logit dominating), the soft distribution is still nearly one-hot even at moderate , and the student recovers little dark knowledge for that example.
Logit Distillation vs. Feature Distillation
Logit distillation (Hinton et al., 2015) matches the teacher's output distribution. This is the standard approach described above.
Feature distillation (Romero et al., 2015, "FitNets") matches intermediate representations. The student is trained so that its hidden layers approximate the teacher's hidden layers. This requires choosing which layers to align (often the middle layer) and a projection to handle dimension mismatches.
Feature distillation can be combined with logit distillation. The advantage: it provides more supervision signal, especially when the student architecture is very different from the teacher. The disadvantage: choosing which layers to align and how to project between them introduces design choices with no clear optimal answer.
When Distillation Fails
Distillation does not always improve over training from scratch.
Capacity gap. If the student is too small to represent the teacher's function, distillation cannot help. The student will learn a biased approximation. Mirzadeh et al. (2020) showed that extremely large teacher-student gaps can actually hurt, and proposed "teacher assistant" distillation (distill teacher to medium model, then medium model to small model).
Task mismatch. If the teacher was trained on a different distribution than the student's target, the soft labels may be misleading. Domain shift between teacher training data and student deployment data degrades distillation.
Label noise. A teacher trained on noisy labels will produce soft targets that encode noise patterns. The student faithfully learns these errors.
Common Confusions
Distillation is not the same as model pruning
Pruning removes weights from an existing model. Distillation trains a new (smaller) model to mimic a larger one. The student architecture can be completely different from the teacher. Pruning preserves the architecture but makes it sparse; distillation allows architectural changes.
The tau-squared factor is not optional
Without the scaling in the loss, the gradients from the KL term shrink as when temperature increases. The factor compensates for this, ensuring the distillation gradient remains at a useful magnitude regardless of temperature.
Canonical Examples
MNIST teacher-student
Teacher: 3-layer MLP with 1200 hidden units per layer (3.6M parameters, 1.2% test error). Student: 2-layer MLP with 800 hidden units (0.96M parameters). Trained on hard labels: 1.9% test error. Trained with distillation (, ): 1.4% test error. The distilled student, with 3.7x fewer parameters, nearly matches the teacher and cuts the test error from 1.9% to 1.4% compared with training from scratch.
Exercises
Problem
For a 3-class problem with teacher logits , compute the teacher's softmax output at and . Which temperature reveals more about the relationship between classes 2 and 3?
Problem
Show that when , the KL divergence approaches a scaled squared Euclidean distance between the logit vectors (after centering). What is the scaling factor?
References
Canonical:
- Hinton, Vinyals, Dean, "Distilling the Knowledge in a Neural Network" (2015), Sections 1-3
- Romero et al., "FitNets: Hints for Thin Deep Nets" (ICLR 2015)
Current:
- Gou et al., "Knowledge Distillation: A Survey" (IJCV 2021), Sections 2-4
- Mirzadeh et al., "Improved Knowledge Distillation via Teacher Assistant" (AAAI 2020)
Next Topics
Distillation connects to model compression, deployment optimization, and the broader question of what knowledge neural networks actually learn.
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A