ML Methods
Meta-Learning
Learning to learn: find model initializations or embedding spaces that enable fast adaptation to new tasks from few examples. MAML, prototypical networks, and the connection to few-shot learning and in-context learning in LLMs.
Prerequisites
Why This Matters
Standard supervised learning assumes abundant labeled data for each task. In many real problems, you have very few labeled examples: 5 examples of a new disease, 10 examples of a new language, 1 example of a user's preferences. Meta-learning addresses this by training across many tasks so that the model can adapt to a new task from very few examples. Unlike standard transfer learning, meta-learning explicitly optimizes for fast adaptation.
The meta-learning framework also provides a theoretical lens for understanding in-context learning in LLMs: the model "learns" a new task from the examples in its prompt without any weight updates.
Problem Formulation
Meta-Learning Problem
A meta-learning setup consists of:
- A distribution over tasks
- Each task has a support set (few labeled examples) and a query set (test examples)
- The meta-learner sees many tasks during training and must generalize to new tasks at test time
In -way -shot classification: each task has classes with labeled examples per class in the support set.
The meta-objective is:
where measures performance on the query set after adapting using the support set.
Three Families of Meta-Learning
Optimization-based: learn an initialization that can be fine-tuned quickly (MAML, Reptile).
Metric-based: learn an embedding space where classification reduces to nearest neighbor (prototypical networks, matching networks, siamese networks).
Model-based: use a model (e.g., an RNN or transformer) that reads the support set and directly outputs predictions on the query set (SNAIL, neural processes).
MAML (Model-Agnostic Meta-Learning)
Finn, Abbeel, and Levine (2017). The idea: find an initialization such that one or a few gradient steps on any task's support set produces a good classifier for that task.
MAML Meta-Gradient
Statement
Let be the meta-parameters. For task with support set , the adapted parameters after one gradient step are:
The meta-objective is:
The meta-gradient is:
This requires computing the Hessian or approximating it. The inner loop uses standard gradient descent, while the outer loop differentiates through it.
Intuition
MAML optimizes not for good performance at , but for good performance after one gradient step from . The meta-gradient asks: how should I change so that a single gradient step on the support set leads to low loss on the query set? This involves differentiating through the inner optimization step.
Proof Sketch
Apply the chain rule. The outer loss is where . By the chain rule: .
Why It Matters
MAML is model-agnostic: it works with any differentiable model. The same algorithm applies to classification, regression, and reinforcement learning. The key insight is that a good initialization is worth more than a good optimizer when data is scarce.
Failure Mode
(1) Computing the Hessian is expensive. First-order MAML (FOMAML) drops this term, using , which works surprisingly well in practice. (2) MAML assumes all tasks share a common structure that can be captured by a shared initialization. If tasks are highly diverse, a single initialization may not suffice. (3) Inner-loop optimization with very few steps can underfit on complex tasks.
Prototypical Networks
Snell, Swersky, and Zemel (2017). Instead of learning an initialization, learn an embedding function such that classification in the embedding space is simple.
Prototypical Network
For each class in the support set, compute the prototype (mean embedding):
Classify a query point by softmax over negative distances:
where is typically the squared Euclidean distance.
Prototypical Networks as Linear Classifiers
Statement
With squared Euclidean distance, the prototypical network decision rule is equivalent to a nearest-centroid classifier in the embedding space. This is a linear classifier in the space of distances to prototypes: the decision boundary between classes and is the perpendicular bisector of the segment .
Intuition
The softmax over negative squared distances is a linear softmax classifier where the "weights" are the prototypes and the "bias" terms are the negative squared norms of the prototypes. The embedding network does the hard work; the classifier itself is as simple as possible.
Proof Sketch
. The first term is constant across classes and cancels in the softmax. The remaining is a linear function of with weight and bias .
Why It Matters
Prototypical networks are simpler and faster than MAML (no inner-loop optimization, no second-order gradients). They work well when a good embedding can be learned and tasks differ mainly in which classes appear.
Failure Mode
The mean embedding (prototype) is sensitive to outliers in the support set. With (one-shot), the prototype is a single example, which may not be representative. For tasks requiring complex decision boundaries, the linear classifier in embedding space may be insufficient.
Connection to In-Context Learning
LLMs perform something resembling meta-learning at test time. When given a prompt with input-output examples, the model adapts its behavior to the demonstrated pattern without weight updates. This in-context learning (ICL) can be viewed through the meta-learning lens:
- Training on diverse web text is the meta-training phase (exposing the model to many "tasks")
- The prompt examples are the support set
- The model's completion is the prediction on the query set
Garg et al. (2022) showed that transformers trained on random linear regression tasks learn to implement ridge regression in context. This provides evidence that transformers can learn optimization algorithms implicitly within their forward pass.
Common Confusions
Meta-learning is not transfer learning
Transfer learning pre-trains on one task and fine-tunes on another. Meta-learning trains on a distribution of tasks and learns to adapt quickly. The distinction: transfer learning requires fine-tuning on the new task (many gradient steps, some data); meta-learning aims for adaptation from very few examples, sometimes without gradient steps.
FOMAML is not just ignoring the Hessian for no reason
First-order MAML drops the Hessian term from the meta-gradient. This is not laziness: Nichol, Achiam, and Schulman (2018) showed that FOMAML and Reptile (a related first-order method) approximate the full MAML gradient in expectation over tasks. The Hessian term matters most when the inner learning rate is large or the loss landscape is highly curved.
Exercises
Problem
In a 5-way 1-shot classification problem using prototypical networks, each prototype is a single embedded example. The query point has embedding and the five prototypes are , , , , . Which class is predicted using squared Euclidean distance?
Problem
Derive the FOMAML meta-gradient by dropping the Hessian term from the full MAML meta-gradient. Under what conditions does the dropped term vanish exactly (making FOMAML exact)?
References
Canonical:
- Finn, Abbeel, Levine, "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks", ICML 2017
- Snell, Swersky, Zemel, "Prototypical Networks for Few-shot Learning", NeurIPS 2017
Current:
- Nichol, Achiam, Schulman, "On First-Order Meta-Learning Algorithms" (Reptile), 2018
- Garg et al., "What Can Transformers Learn In-Context? A Case Study of Simple Function Classes", NeurIPS 2022
- Hospedales et al., "Meta-Learning in Neural Networks: A Survey", IEEE TPAMI 2022
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A