Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Meta-Learning

Learning to learn: find model initializations or embedding spaces that enable fast adaptation to new tasks from few examples. MAML, prototypical networks, and the connection to few-shot learning and in-context learning in LLMs.

AdvancedTier 2Current~55 min
0

Why This Matters

Standard supervised learning assumes abundant labeled data for each task. In many real problems, you have very few labeled examples: 5 examples of a new disease, 10 examples of a new language, 1 example of a user's preferences. Meta-learning addresses this by training across many tasks so that the model can adapt to a new task from very few examples. Unlike standard transfer learning, meta-learning explicitly optimizes for fast adaptation.

The meta-learning framework also provides a theoretical lens for understanding in-context learning in LLMs: the model "learns" a new task from the examples in its prompt without any weight updates.

Problem Formulation

Definition

Meta-Learning Problem

A meta-learning setup consists of:

  1. A distribution over tasks p(T)p(\mathcal{T})
  2. Each task Ti\mathcal{T}_i has a support set SiS_i (few labeled examples) and a query set QiQ_i (test examples)
  3. The meta-learner sees many tasks during training and must generalize to new tasks at test time

In NN-way KK-shot classification: each task has NN classes with KK labeled examples per class in the support set.

The meta-objective is:

minθETp(T)[L(θ;ST,QT)]\min_\theta \mathbb{E}_{\mathcal{T} \sim p(\mathcal{T})} \left[ L(\theta; S_\mathcal{T}, Q_\mathcal{T}) \right]

where LL measures performance on the query set after adapting using the support set.

Three Families of Meta-Learning

Optimization-based: learn an initialization that can be fine-tuned quickly (MAML, Reptile).

Metric-based: learn an embedding space where classification reduces to nearest neighbor (prototypical networks, matching networks, siamese networks).

Model-based: use a model (e.g., an RNN or transformer) that reads the support set and directly outputs predictions on the query set (SNAIL, neural processes).

MAML (Model-Agnostic Meta-Learning)

Finn, Abbeel, and Levine (2017). The idea: find an initialization θ\theta such that one or a few gradient steps on any task's support set produces a good classifier for that task.

Theorem

MAML Meta-Gradient

Statement

Let θ\theta be the meta-parameters. For task Ti\mathcal{T}_i with support set SiS_i, the adapted parameters after one gradient step are:

ϕi=θαθL(Si;θ)\phi_i = \theta - \alpha \nabla_\theta L(S_i; \theta)

The meta-objective is:

Lmeta(θ)=iL(Qi;ϕi)=iL(Qi;θαθL(Si;θ))L_{\text{meta}}(\theta) = \sum_i L(Q_i; \phi_i) = \sum_i L(Q_i; \theta - \alpha \nabla_\theta L(S_i; \theta))

The meta-gradient is:

θLmeta=i(Iαθ2L(Si;θ))ϕiL(Qi;ϕi)\nabla_\theta L_{\text{meta}} = \sum_i (I - \alpha \nabla^2_\theta L(S_i; \theta)) \nabla_{\phi_i} L(Q_i; \phi_i)

This requires computing the Hessian θ2L(Si;θ)\nabla^2_\theta L(S_i; \theta) or approximating it. The inner loop uses standard gradient descent, while the outer loop differentiates through it.

Intuition

MAML optimizes not for good performance at θ\theta, but for good performance after one gradient step from θ\theta. The meta-gradient asks: how should I change θ\theta so that a single gradient step on the support set leads to low loss on the query set? This involves differentiating through the inner optimization step.

Proof Sketch

Apply the chain rule. The outer loss is L(Qi;ϕi)L(Q_i; \phi_i) where ϕi=θαθL(Si;θ)\phi_i = \theta - \alpha \nabla_\theta L(S_i; \theta). By the chain rule: L(Qi;ϕi)θ=Lϕiϕiθ=ϕiL(Qi;ϕi)(Iαθ2L(Si;θ))\frac{\partial L(Q_i; \phi_i)}{\partial \theta} = \frac{\partial L}{\partial \phi_i} \cdot \frac{\partial \phi_i}{\partial \theta} = \nabla_{\phi_i} L(Q_i; \phi_i) \cdot (I - \alpha \nabla^2_\theta L(S_i; \theta)).

Why It Matters

MAML is model-agnostic: it works with any differentiable model. The same algorithm applies to classification, regression, and reinforcement learning. The key insight is that a good initialization is worth more than a good optimizer when data is scarce.

Failure Mode

(1) Computing the Hessian θ2L\nabla^2_\theta L is expensive. First-order MAML (FOMAML) drops this term, using θLmetaiϕiL(Qi;ϕi)\nabla_\theta L_{\text{meta}} \approx \sum_i \nabla_{\phi_i} L(Q_i; \phi_i), which works surprisingly well in practice. (2) MAML assumes all tasks share a common structure that can be captured by a shared initialization. If tasks are highly diverse, a single initialization may not suffice. (3) Inner-loop optimization with very few steps can underfit on complex tasks.

Prototypical Networks

Snell, Swersky, and Zemel (2017). Instead of learning an initialization, learn an embedding function fθ:XRdf_\theta: \mathcal{X} \to \mathbb{R}^d such that classification in the embedding space is simple.

Definition

Prototypical Network

For each class cc in the support set, compute the prototype (mean embedding):

μc=1Sc(x,y)Scfθ(x)\mu_c = \frac{1}{|S_c|} \sum_{(x, y) \in S_c} f_\theta(x)

Classify a query point xx^* by softmax over negative distances:

p(y=cx)=exp(d(fθ(x),μc))cexp(d(fθ(x),μc))p(y = c \mid x^*) = \frac{\exp(-d(f_\theta(x^*), \mu_c))}{\sum_{c'} \exp(-d(f_\theta(x^*), \mu_{c'}))}

where dd is typically the squared Euclidean distance.

Proposition

Prototypical Networks as Linear Classifiers

Statement

With squared Euclidean distance, the prototypical network decision rule is equivalent to a nearest-centroid classifier in the embedding space. This is a linear classifier in the space of distances to prototypes: the decision boundary between classes cc and cc' is the perpendicular bisector of the segment μcμc\mu_c \mu_{c'}.

Intuition

The softmax over negative squared distances is a linear softmax classifier where the "weights" are the prototypes and the "bias" terms are the negative squared norms of the prototypes. The embedding network does the hard work; the classifier itself is as simple as possible.

Proof Sketch

fθ(x)μc2=fθ(x)2+2μcTfθ(x)μc2-\|f_\theta(x^*) - \mu_c\|^2 = -\|f_\theta(x^*)\|^2 + 2\mu_c^T f_\theta(x^*) - \|\mu_c\|^2. The first term is constant across classes and cancels in the softmax. The remaining 2μcTfθ(x)μc22\mu_c^T f_\theta(x^*) - \|\mu_c\|^2 is a linear function of fθ(x)f_\theta(x^*) with weight 2μc2\mu_c and bias μc2-\|\mu_c\|^2.

Why It Matters

Prototypical networks are simpler and faster than MAML (no inner-loop optimization, no second-order gradients). They work well when a good embedding can be learned and tasks differ mainly in which classes appear.

Failure Mode

The mean embedding (prototype) is sensitive to outliers in the support set. With K=1K = 1 (one-shot), the prototype is a single example, which may not be representative. For tasks requiring complex decision boundaries, the linear classifier in embedding space may be insufficient.

Connection to In-Context Learning

LLMs perform something resembling meta-learning at test time. When given a prompt with input-output examples, the model adapts its behavior to the demonstrated pattern without weight updates. This in-context learning (ICL) can be viewed through the meta-learning lens:

  • Training on diverse web text is the meta-training phase (exposing the model to many "tasks")
  • The prompt examples are the support set
  • The model's completion is the prediction on the query set

Garg et al. (2022) showed that transformers trained on random linear regression tasks learn to implement ridge regression in context. This provides evidence that transformers can learn optimization algorithms implicitly within their forward pass.

Common Confusions

Watch Out

Meta-learning is not transfer learning

Transfer learning pre-trains on one task and fine-tunes on another. Meta-learning trains on a distribution of tasks and learns to adapt quickly. The distinction: transfer learning requires fine-tuning on the new task (many gradient steps, some data); meta-learning aims for adaptation from very few examples, sometimes without gradient steps.

Watch Out

FOMAML is not just ignoring the Hessian for no reason

First-order MAML drops the Hessian term from the meta-gradient. This is not laziness: Nichol, Achiam, and Schulman (2018) showed that FOMAML and Reptile (a related first-order method) approximate the full MAML gradient in expectation over tasks. The Hessian term matters most when the inner learning rate is large or the loss landscape is highly curved.

Exercises

ExerciseCore

Problem

In a 5-way 1-shot classification problem using prototypical networks, each prototype is a single embedded example. The query point xx^* has embedding fθ(x)=[1,0]f_\theta(x^*) = [1, 0] and the five prototypes are μ1=[0.8,0.2]\mu_1 = [0.8, 0.2], μ2=[3.0,1.0]\mu_2 = [3.0, 1.0], μ3=[1.0,0.5]\mu_3 = [-1.0, 0.5], μ4=[0.5,0.5]\mu_4 = [0.5, -0.5], μ5=[2.0,2.0]\mu_5 = [2.0, 2.0]. Which class is predicted using squared Euclidean distance?

ExerciseAdvanced

Problem

Derive the FOMAML meta-gradient by dropping the Hessian term from the full MAML meta-gradient. Under what conditions does the dropped term vanish exactly (making FOMAML exact)?

References

Canonical:

  • Finn, Abbeel, Levine, "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks", ICML 2017
  • Snell, Swersky, Zemel, "Prototypical Networks for Few-shot Learning", NeurIPS 2017

Current:

  • Nichol, Achiam, Schulman, "On First-Order Meta-Learning Algorithms" (Reptile), 2018
  • Garg et al., "What Can Transformers Learn In-Context? A Case Study of Simple Function Classes", NeurIPS 2022
  • Hospedales et al., "Meta-Learning in Neural Networks: A Survey", IEEE TPAMI 2022

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.