Transfer Learning

Sneiderman, Robby

ML Methods

Transfer Learning

Pretrain on a large dataset, fine-tune on a smaller target: why lower layers learn transferable features, feature extraction vs fine-tuning, domain adaptation, negative transfer, and the foundation model paradigm.

AdvancedTier 2StableSupporting~45 min

Prerequisites

Feedforward Networks and Backpropagation Representation Learning in Cosmology

Start 8-question practice · 3 available Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 3 | tier 2. This page has 2 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Most practical ML systems do not train from scratch. A vision model starts from ImageNet-pretrained convolutional neural network weights. A language model starts from a large corpus pretraining. Transfer learning is the reason modern ML achieves strong results with small labeled datasets: the pretrained model already encodes useful representations.

The theoretical question is precise: under what conditions does training on source data $\mathcal{D}_S$ improve performance on a different target distribution $\mathcal{D}_T$ ? The answer involves both the similarity of the distributions and the structure of the learned representations.

Formal Setup

Definition

Source and Target Domains

A source domain $\mathcal{D}_S$ provides abundant labeled data $(x, y) \sim \mathcal{D}_S$ . A target domain $\mathcal{D}_T$ is the distribution we want to perform well on, typically with limited labeled data. Transfer learning trains on $\mathcal{D}_S$ and adapts to $\mathcal{D}_T$ .

Definition

Feature Extraction

Feature extraction uses the pretrained model as a fixed feature extractor. Freeze all layers except the final classification head. Train only the head on target data. This is appropriate when target data is very limited and source features are already informative.

Definition

Fine-Tuning

Fine-tuning initializes from pretrained weights and continues training all (or some) layers on target data, typically with a smaller learning rate. This allows the model to adapt its internal representations to the target domain.

Why Transfer Works: Feature Hierarchy

Zeiler and Fergus (2014) visualized CNN features across layers. The pattern is consistent:

Layer 1: edges, color gradients. These features are universal across visual tasks.
Layers 2-3: textures, simple patterns. Still largely task-independent.
Layers 4+: object parts, task-specific combinations.

Yosinski et al. (2014) measured feature transferability quantitatively by training on one half of ImageNet classes and transferring to the other half. Transferring layer $k$ and retraining layers $k+1, \ldots, L$ showed that early layers transfer well, middle layers transfer moderately, and late layers are task-specific.

This gradient of generality explains when feature extraction (using early/middle layers) suffices and when fine-tuning (adapting late layers) is necessary.

Domain Adaptation Theory

Theorem

Ben-David Domain Adaptation Bound

Statement

For any hypothesis $h \in \mathcal{H}$ , the target risk is bounded by:

$R_T(h) \leq R_S(h) + d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{D}_S, \mathcal{D}_T) + \lambda^*$

where $R_S(h)$ is the source risk, $d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{D}_S, \mathcal{D}_T)$ is the $\mathcal{H}\Delta\mathcal{H}$ -divergence between the source and target distributions, and $\lambda^* = \min_{h \in \mathcal{H}}[R_S(h) + R_T(h)]$ is the combined error of the ideal joint hypothesis.

Intuition

Target error is bounded by three terms: (1) how well you do on the source, (2) how different the domains are (measured by the best classifier that can distinguish between them), and (3) whether any single hypothesis can do well on both domains simultaneously. If $\lambda^*$ is large, no hypothesis can do well on both domains, and transfer is hopeless regardless of the algorithm.

Proof Sketch

Write $R_T(h) = R_S(h) + (R_T(h) - R_S(h))$ . The difference $|R_T(h) - R_S(h)|$ is bounded by the $\mathcal{H}\Delta\mathcal{H}$ -divergence (which measures the maximum discrepancy between distributions over hypothesis pairs) plus the irreducible error $\lambda^*$ . The divergence term uses the triangle inequality for the classification error metric.

Why It Matters

This bound makes precise when transfer learning can and cannot work. It says: (1) you need low source error (the pretrained model must be good), (2) the domains must be similar in a precise sense (low divergence), and (3) the task must be compatible (low $\lambda^*$ ). Condition (3) is often ignored but is the most important: if no single model can do well on both domains, more data from the source cannot help.

Failure Mode

The $\mathcal{H}\Delta\mathcal{H}$ -divergence is generally intractable to compute exactly. The empirical proxy (train a classifier to distinguish source from target, measure its accuracy) is a useful heuristic but can underestimate the true divergence. The bound also assumes the same label space, which fails for cross-task transfer.

report a correction →

When Transfer Fails: Negative Transfer

Negative transfer occurs when pretraining on the source domain hurts target performance compared to training from scratch. This happens when:

Domain mismatch: source and target are too dissimilar. Medical X-rays pretrained on ImageNet can suffer because natural image features (textures, color) are irrelevant to radiology (where contrast and structure matter).
Label space conflict: source categories interfere with target categories. A model pretrained to distinguish dog breeds may develop features that are too specific to transfer to a general object recognition task.
Overfitting to source: extensive pretraining on a narrow source domain embeds strong inductive biases that are hard to override with limited target data.

Detecting negative transfer in advance is difficult. The empirical practice is to compare against a from-scratch baseline.

Foundation Models

Foundation models (BERT, GPT, ViT) represent the extreme of transfer learning: pretrain a very large model on a very large dataset, then adapt to many downstream tasks. The scaling laws literature studies how model size and data quantity jointly determine downstream performance. The bet is that scale and diversity of pretraining data produce representations that are broadly useful.

This works in practice but has no strong theoretical justification. The Ben-David bound requires low domain divergence, but a foundation model pretrained on internet text is applied to tasks (code generation, medical QA, legal analysis) with very different distributions. The gap between theory and practice here is large, and honest analysis requires acknowledging this.

Practical Choices

When to use feature extraction: target data is very small (hundreds of examples), source and target domains are similar, or compute budget is limited.

When to fine-tune: target data is moderate (thousands of examples), domains differ enough that pretrained features need adaptation, or task requires nuanced output (fine-grained classification).

Learning rate for fine-tuning: typically 10x to 100x smaller than training from scratch. Too large a learning rate destroys pretrained features ("catastrophic forgetting"). Discriminative learning rates (smaller for earlier layers, larger for later layers) often help.

Common Confusions

Watch Out

Transfer learning is not just about shared features

Transfer also works through shared structure in the loss landscape. Pretrained weights provide an initialization in a region of weight space where gradient descent converges to good solutions faster. Even when the learned features are not directly useful, the pretrained initialization can help optimization.

Watch Out

More pretraining data does not always help transfer

If the additional pretraining data increases domain divergence (e.g., adding noisy web data to a curated medical corpus), it can hurt transfer performance. Quality and relevance of pretraining data matter more than raw quantity for downstream performance.

Canonical Examples

Example

ImageNet to medical imaging (CheXNet)

Rajpurkar et al. 2017 (CheXNet, arXiv:1711.05225) transferred a DenseNet-121 pretrained on ImageNet to the ChestX-ray14 dataset (112,120 frontal X-rays, 14 pathology labels). End-to-end fine-tuning from ImageNet weights reached per-pathology AUCs competitive with board-certified radiologists on the pneumonia sub-task. Raghu et al. 2019 (Transfusion, arXiv:1902.07208) then showed that on this same regime the benefit of ImageNet initialization is mostly from faster convergence and better weight-statistics, not from feature reuse: a standard-size ImageNet-pretrained model trains faster and reaches comparable or slightly higher AUC than training from scratch, while smaller task-specific architectures train from scratch with no accuracy gap. Conclusion: for medical imaging, ImageNet transfer is an optimization prior more than a feature prior.

Exercises

ExerciseCore

Problem

You have a model pretrained on ImageNet and want to classify 500 images of 5 species of butterflies. Should you use feature extraction or fine-tuning? Justify your choice using the domain adaptation bound.

ExerciseAdvanced

Problem

Explain negative transfer using the Ben-David bound. Give a concrete scenario where each of the three terms in the bound ( $R_S(h)$ , $d_{\mathcal{H}\Delta\mathcal{H}}$ , $\lambda^*$ ) causes transfer to fail.

References

Canonical:

Yosinski et al., "How Transferable Are Features in Deep Neural Networks?" (NeurIPS 2014)
Ben-David et al., "A Theory of Learning from Different Domains" (Machine Learning, 2010), Sections 2-4

Current:

Zhuang et al., "A Comprehensive Survey on Transfer Learning" (Proc. IEEE 2021), Sections 2-5
Bommasani et al., "On the Opportunities and Risks of Foundation Models" (2021), Section 4
Rajpurkar et al., "CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning" (2017), arXiv:1711.05225. Canonical ImageNet to chest-X-ray transfer result.
Raghu et al., "Transfusion: Understanding Transfer Learning for Medical Imaging" (NeurIPS 2019), arXiv:1902.07208. Shows the benefit of ImageNet initialization on medical imaging is primarily optimization-prior rather than feature-reuse.

Next Topics

Transfer learning connects to fine-tuning strategies, domain adaptation methods, and the broader study of representation learning.

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Feedforward Networks and Backpropagationlayer 2 · tier 1
Representation Learning in Cosmologylayer 4 · tier 3

Derived topics

0

No published topic currently declares this as a prerequisite.