Knowledge Distillation

Sneiderman, Robby

LLM Construction

Knowledge Distillation

Training a small student model to mimic a large teacher: soft targets, temperature scaling, dark knowledge, and why the teacher's mistakes carry useful information about class structure.

AdvancedTier 2CurrentSupporting~45 min

Prerequisites

Feedforward Networks and Backpropagation Iterative Magnitude Pruning and Lottery Ticket Hypothesis

Start 8-question practice · 4 available Prereq Map

Learning position

Read this page in the graph.

llm-construction | layer 3 | tier 2. This page has 2 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Synthetic Data Distillation

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Knowledge distillation overview showing teacher logits, temperature scaling, soft targets, and student deployment. — Knowledge distillation pipeline showing teacher logits, temperature scaling, soft targets, and student deployment.

Large models (billions of parameters) achieve strong performance but are expensive to serve. A 175B parameter model requires multiple GPUs for inference, making it impractical for edge and on-device ML deployment. Knowledge distillation trains a small "student" model to approximate a large "teacher" model, often recovering most of the teacher's accuracy at a fraction of the cost.

Distillation is not just compression. The student trained on teacher outputs often outperforms an identical architecture trained only on hard labels. The teacher's soft probability distribution over classes contains information that hard labels discard.

Formal Setup

Let $f_T$ be a teacher network and $f_S$ a student network. Both map inputs $x$ to logit vectors $z \in \mathbb{R}^K$ for $K$ classes.

Definition

Temperature-Scaled Softmax $p_{i}^{τ}$

For logits $z = (z_1, \ldots, z_K)$ and temperature $\tau > 0$ :

$p_i^\tau = \frac{\exp(z_i / \tau)}{\sum_{j=1}^K \exp(z_j / \tau)}$

At $\tau = 1$ , this is the standard softmax. As $\tau \to \infty$ , the distribution approaches uniform. As $\tau \to 0$ , it approaches a one-hot vector at the argmax.

Definition

Distillation Loss

The distillation loss combines two terms:

$\mathcal{L}_{\text{distill}} = (1-\alpha) \cdot \mathcal{L}_{\text{CE}}(p_S^1, y) + \alpha \cdot \tau^2 \cdot \mathrm{KL}(p_T^\tau \| p_S^\tau)$

where $p_T^\tau$ and $p_S^\tau$ are the teacher and student softmax outputs at temperature $\tau$ , $y$ is the hard label, $\mathrm{KL}$ is the KL divergence, and $\alpha \in [0,1]$ balances the two terms. The $\tau^2$ factor compensates for the reduced gradient magnitudes at high temperature.

Why Soft Targets Help: Dark Knowledge

Consider a teacher trained on CIFAR-10. For an image of a car, the teacher might output: car 0.9, truck 0.05, airplane 0.03, ship 0.02. The hard label says only "car." But the teacher's soft distribution encodes that trucks look more like cars than airplanes do, and airplanes look more like cars than ships do.

This relational information between classes is what Hinton et al. (2015) call dark knowledge. It is invisible in the hard labels but present in the teacher's soft outputs. High temperature amplifies these small probabilities, making the dark knowledge more accessible to the student.

Proposition

Distillation Gradient at High Temperature

Statement

At high temperature $\tau \gg |z_i|$ for all logits, the gradient of the KL divergence term with respect to the student logits $z_S$ approximates:

$\frac{\partial}{\partial z_{S,i}} \mathrm{KL}(p_T^\tau \| p_S^\tau) \approx \frac{1}{K\tau^2}(z_{S,i} - z_{T,i})$

That is, the distillation loss encourages the student logits to match the teacher logits, weighted by $1/\tau^2$ . The $\tau^2$ prefactor in the distillation loss cancels this, giving gradients of order $1/K$ .

Intuition

At high temperature, softmax becomes nearly linear: $p_i^\tau \approx 1/K + z_i/(K\tau)$ . The KL divergence between two nearly-uniform distributions reduces to a squared difference between the logit vectors. Distillation at high temperature is approximately matching teacher logits, which is a regression problem. This is gentler than matching sharp distributions and provides richer gradient signal.

Proof Sketch

Taylor expand $\exp(z_i/\tau)$ around $z_i = 0$ for large $\tau$ : $\exp(z_i/\tau) \approx 1 + z_i/\tau + z_i^2/(2\tau^2)$ . The softmax becomes $p_i^\tau \approx 1/K + z_i/(K\tau) - \bar{z}/(K\tau)$ where $\bar{z}$ is the mean logit. Substitute into the KL formula and differentiate.

Why It Matters

This explains the $\tau^2$ factor in the loss and why moderate temperatures ( $\tau \in [2, 20]$ ) work best. Too low: the student just sees near-one-hot targets. Too high: all class information is washed out. The sweet spot preserves the ranking and relative magnitudes of teacher logits.

Failure Mode

The high-temperature approximation breaks when some logits are much larger than $\tau$ . For confident teacher predictions (one logit dominating), the soft distribution is still nearly one-hot even at moderate $\tau$ , and the student recovers little dark knowledge for that example.

report a correction →

Logit Distillation vs. Feature Distillation

Logit distillation (Hinton et al., 2015) matches the teacher's output distribution. This is the standard approach described above.

Feature distillation (Romero et al., 2015, "FitNets") matches intermediate representations. The student is trained so that its hidden layers approximate the teacher's hidden layers. This requires choosing which layers to align (often the middle layer) and a projection to handle dimension mismatches.

Feature distillation can be combined with logit distillation. The advantage: it provides more supervision signal, especially when the student architecture is very different from the teacher. The disadvantage: choosing which layers to align and how to project between them introduces design choices with no clear optimal answer.

When Distillation Fails

Distillation does not always improve over training from scratch.

Capacity gap. If the student is too small to represent the teacher's function, distillation cannot help. The student will learn a biased approximation. Mirzadeh et al. (2020) showed that extremely large teacher-student gaps can actually hurt, and proposed "teacher assistant" distillation (distill teacher to medium model, then medium model to small model).

Task mismatch. If the teacher was trained on a different distribution than the student's target, the soft labels may be misleading. Domain shift between teacher training data and student deployment data degrades distillation.

Label noise. A teacher trained on noisy labels will produce soft targets that encode noise patterns. The student faithfully learns these errors.

Evaluation Ladder

Distillation is a tradeoff, not a single score. A distilled model can look successful on top-1 accuracy while failing the real deployment target. Read the evaluation in layers:

Layer	What to measure	What it catches
Teacher fidelity	Teacher-output divergence, logit mean-squared error, agreement rate	Whether the student copied the teacher distribution on the query set
Task utility	Accuracy, F1, pass-at-k, reward-model score on held-out data	Whether copying the teacher helps the task users care about
Calibration	ECE, Brier score, reliability curves	Whether the student inherited or amplified teacher overconfidence
Cost and latency	Parameters, active FLOPs, memory, tokens/sec, p95 latency	Whether compression produced a usable serving model
Shift behavior	Same metrics on domain-shifted or adversarial slices	Whether the student only learned the teacher's easy regions

The falsifiable claim is: for the same deployment budget, the distilled student keeps enough task utility and calibration while reducing latency, memory, or energy. If a paper reports only accuracy retention and parameter count, it has not fully evaluated the distillation tradeoff.

Worked Evaluation Pattern

A useful distillation report separates three baselines:

Student from scratch. Same architecture, same training budget, hard labels only.
Student with distillation. Same architecture, teacher supervision added.
Teacher. The upper target for fidelity and the reference for error transfer.

Then inspect disagreement cases. If teacher and distilled student both fail on the same examples, the student copied teacher blind spots. If the distilled student fails where the teacher succeeds, the compression lost capacity or training signal. If the distilled student succeeds where the teacher fails, check whether this comes from hard labels, data augmentation, regularization, or noise. That case is possible, but it should not be explained as pure imitation.

Common Confusions

Watch Out

Distillation is not the same as model pruning

Pruning removes weights from an existing model. Distillation trains a new (smaller) model to mimic a larger one. The student architecture can be completely different from the teacher. Pruning preserves the architecture but makes it sparse; distillation allows architectural changes.

Watch Out

The tau-squared factor is a scaling convention, not a validity condition

The Hinton-style $\tau^2$ multiplier on the KL term compensates for the $1/\tau^2$ shrinkage of distillation gradients at higher temperature, keeping gradient magnitudes comparable across temperature settings. It is not a mathematical requirement for distillation to be valid: omitting $\tau^2$ simply shifts the effective scale of the distillation loss, which can be absorbed into the distillation weight $\alpha$ , the learning rate, or other loss-balancing choices. In practice it is conventional and recommended (and most pedagogical content uses it), but if you drop it, you must compensate elsewhere — not adjust the underlying objective.

Watch Out

A smaller model is not automatically a better model

Compression is useful only relative to a constraint. A 10x smaller model that loses 15 points of accuracy may be a good edge model and a bad server model. A 2x smaller model that keeps calibration and cuts p95 latency in half may be the better product choice. The right comparison depends on the binding constraint: memory, latency, energy, privacy, or offline availability.

Canonical Examples

Example

MNIST teacher-student

Teacher: 3-layer MLP with 1200 hidden units per layer (3.6M parameters, 1.2% test error). Student: 2-layer MLP with 800 hidden units (0.96M parameters). Trained on hard labels: 1.9% test error. Trained with distillation ( $\tau = 8$ , $\alpha = 0.7$ ): 1.4% test error. The distilled student, with 3.7x fewer parameters, nearly matches the teacher and cuts the test error from 1.9% to 1.4% compared with training from scratch.

Exercises

ExerciseCore

Problem

For a 3-class problem with teacher logits $z_T = (5, 3, 1)$ , compute the teacher's softmax output at $\tau = 1$ and $\tau = 5$ . Which temperature reveals more about the relationship between classes 2 and 3?

ExerciseAdvanced

Problem

Show that when $\tau \to \infty$ , the KL divergence $\mathrm{KL}(p_T^\tau \| p_S^\tau)$ approaches a scaled squared Euclidean distance between the logit vectors (after centering). What is the scaling factor?

References

Canonical:

Hinton, Vinyals, Dean, "Distilling the Knowledge in a Neural Network" (2015), Sections 1-3
Romero et al., "FitNets: Hints for Thin Deep Nets" (ICLR 2015)

Current:

Gou et al., "Knowledge Distillation: A Survey" (IJCV 2021), Sections 2-4
Mirzadeh et al., "Improved Knowledge Distillation via Teacher Assistant" (AAAI 2020)

Next Topics

Distillation connects to model compression, deployment optimization, and the broader question of what knowledge neural networks actually learn.

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Feedforward Networks and Backpropagationlayer 2 · tier 1
Iterative Magnitude Pruning and the Lottery Ticket Hypothesislayer 4 · tier 2

Derived topics

1

Synthetic Data Distillationlayer 3 · tier 2

Graph-backed continuations

Synthetic Data Distillation