Meta-Learning

Sneiderman, Robby

ML Methods

Meta-Learning

Learning to learn: find model initializations or embedding spaces that enable fast adaptation to new tasks from few examples. MAML, prototypical networks, and the connection to few-shot learning and in-context learning in LLMs.

AdvancedTier 2CurrentSupporting~55 min

Prerequisites

Feedforward Networks and Backpropagation Test Time Training

Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 3 | tier 2. This page has 2 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Standard supervised learning assumes abundant labeled data for each task. In many real problems, you have very few labeled examples: 5 examples of a new disease, 10 examples of a new language, 1 example of a user's preferences. Meta-learning addresses this by training across many tasks so that the model can adapt to a new task from very few examples. Unlike standard transfer learning, meta-learning explicitly optimizes for fast adaptation.

The meta-learning framework also provides a theoretical lens for understanding in-context learning in LLMs: the model "learns" a new task from the examples in its prompt without any weight updates.

Problem Formulation

Definition

Meta-Learning Problem

A meta-learning setup consists of:

A distribution over tasks $p(\mathcal{T})$
Each task $\mathcal{T}_i$ has a support set $S_i$ (few labeled examples) and a query set $Q_i$ (test examples)
The meta-learner sees many tasks during training and must generalize to new tasks at test time

In $N$ -way $K$ -shot classification: each task has $N$ classes with $K$ labeled examples per class in the support set.

The meta-objective is:

$\min_\theta \mathbb{E}_{\mathcal{T} \sim p(\mathcal{T})} \left[ L(\theta; S_\mathcal{T}, Q_\mathcal{T}) \right]$

where $L$ measures performance on the query set after adapting using the support set.

Three Families of Meta-Learning

Optimization-based: learn an initialization that can be fine-tuned quickly (MAML, Reptile).

Metric-based: learn an embedding space where classification reduces to nearest neighbor (prototypical networks, matching networks, siamese networks).

Model-based: use a model (e.g., an RNN or transformer) that reads the support set and directly outputs predictions on the query set (SNAIL, neural processes).

MAML (Model-Agnostic Meta-Learning)

Finn, Abbeel, and Levine (2017). The idea: find an initialization $\theta$ such that one or a few gradient steps on any task's support set produces a good classifier for that task.

Theorem

MAML Meta-Gradient

Statement

Let $\theta$ be the meta-parameters. For task $\mathcal{T}_i$ with support set $S_i$ , the adapted parameters after one gradient step are:

$\phi_i = \theta - \alpha \nabla_\theta L(S_i; \theta)$

The meta-objective is:

$L_{\text{meta}}(\theta) = \sum_i L(Q_i; \phi_i) = \sum_i L(Q_i; \theta - \alpha \nabla_\theta L(S_i; \theta))$

The meta-gradient is:

$\nabla_\theta L_{\text{meta}} = \sum_i (I - \alpha \nabla^2_\theta L(S_i; \theta)) \nabla_{\phi_i} L(Q_i; \phi_i)$

This requires computing the Hessian $\nabla^2_\theta L(S_i; \theta)$ or approximating it. The inner loop uses standard gradient descent, while the outer loop differentiates through it.

Intuition

MAML optimizes not for good performance at $\theta$ , but for good performance after one gradient step from $\theta$ . The meta-gradient asks: how should I change $\theta$ so that a single gradient step on the support set leads to low loss on the query set? This involves differentiating through the inner optimization step.

Proof Sketch

Apply the chain rule. The outer loss is $L(Q_i; \phi_i)$ where $\phi_i = \theta - \alpha \nabla_\theta L(S_i; \theta)$ . By the chain rule: $\frac{\partial L(Q_i; \phi_i)}{\partial \theta} = \frac{\partial L}{\partial \phi_i} \cdot \frac{\partial \phi_i}{\partial \theta} = \nabla_{\phi_i} L(Q_i; \phi_i) \cdot (I - \alpha \nabla^2_\theta L(S_i; \theta))$ .

Why It Matters

MAML is model-agnostic: it works with any differentiable model. The same algorithm applies to classification, regression, and reinforcement learning. The key insight is that a good initialization is worth more than a good optimizer when data is scarce.

Failure Mode

(1) Computing the Hessian $\nabla^2_\theta L$ is expensive. First-order MAML (FOMAML) drops this term, using $\nabla_\theta L_{\text{meta}} \approx \sum_i \nabla_{\phi_i} L(Q_i; \phi_i)$ , which works surprisingly well in practice. (2) MAML assumes all tasks share a common structure that can be captured by a shared initialization. If tasks are highly diverse, a single initialization may not suffice. (3) Inner-loop optimization with very few steps can underfit on complex tasks.

report a correction →

Prototypical Networks

Snell, Swersky, and Zemel (2017). Instead of learning an initialization, learn an embedding function $f_\theta: \mathcal{X} \to \mathbb{R}^d$ such that classification in the embedding space is simple.

Definition

Prototypical Network

For each class $c$ in the support set, compute the prototype (mean embedding):

$\mu_c = \frac{1}{|S_c|} \sum_{(x, y) \in S_c} f_\theta(x)$

Classify a query point $x^*$ by softmax over negative distances:

$p(y = c \mid x^*) = \frac{\exp(-d(f_\theta(x^*), \mu_c))}{\sum_{c'} \exp(-d(f_\theta(x^*), \mu_{c'}))}$

where $d$ is typically the squared Euclidean distance.

Proposition

Prototypical Networks as Linear Classifiers

Statement

With squared Euclidean distance, the prototypical network decision rule is equivalent to a nearest-centroid classifier in the embedding space. This is a linear classifier in the space of distances to prototypes: the decision boundary between classes $c$ and $c'$ is the perpendicular bisector of the segment $\mu_c \mu_{c'}$ .

Intuition

The softmax over negative squared distances is a linear softmax classifier where the "weights" are the prototypes and the "bias" terms are the negative squared norms of the prototypes. The embedding network does the hard work; the classifier itself is as simple as possible.

Proof Sketch

$-\|f_\theta(x^*) - \mu_c\|^2 = -\|f_\theta(x^*)\|^2 + 2\mu_c^T f_\theta(x^*) - \|\mu_c\|^2$ . The first term is constant across classes and cancels in the softmax. The remaining $2\mu_c^T f_\theta(x^*) - \|\mu_c\|^2$ is a linear function of $f_\theta(x^*)$ with weight $2\mu_c$ and bias $-\|\mu_c\|^2$ .

Why It Matters

Prototypical networks are simpler and faster than MAML (no inner-loop optimization, no second-order gradients). They work well when a good embedding can be learned and tasks differ mainly in which classes appear.

Failure Mode

The mean embedding (prototype) is sensitive to outliers in the support set. With $K = 1$ (one-shot), the prototype is a single example, which may not be representative. For tasks requiring complex decision boundaries, the linear classifier in embedding space may be insufficient.

report a correction →

Connection to In-Context Learning

LLMs perform something resembling meta-learning at test time. When given a prompt with input-output examples, the model adapts its behavior to the demonstrated pattern without weight updates. This in-context learning (ICL) can be viewed through the meta-learning lens:

Training on diverse web text is the meta-training phase (exposing the model to many "tasks")
The prompt examples are the support set
The model's completion is the prediction on the query set

Garg et al. (2022) showed that transformers trained on random linear regression tasks learn to implement ridge regression in context. This provides evidence that transformers can learn optimization algorithms implicitly within their forward pass.

Common Confusions

Watch Out

Meta-learning is not transfer learning

Transfer learning pre-trains on one task and fine-tunes on another. Meta-learning trains on a distribution of tasks and learns to adapt quickly. The distinction: transfer learning requires fine-tuning on the new task (many gradient steps, some data); meta-learning aims for adaptation from very few examples, sometimes without gradient steps.

Watch Out

FOMAML is not just ignoring the Hessian for no reason

First-order MAML drops the Hessian term from the meta-gradient. This is not laziness: Nichol, Achiam, and Schulman (2018) showed that FOMAML and Reptile (a related first-order method) approximate the full MAML gradient in expectation over tasks. The Hessian term matters most when the inner learning rate is large or the loss landscape is highly curved.

Exercises

ExerciseCore

Problem

In a 5-way 1-shot classification problem using prototypical networks, each prototype is a single embedded example. The query point $x^*$ has embedding $f_\theta(x^*) = [1, 0]$ and the five prototypes are $\mu_1 = [0.8, 0.2]$ , $\mu_2 = [3.0, 1.0]$ , $\mu_3 = [-1.0, 0.5]$ , $\mu_4 = [0.5, -0.5]$ , $\mu_5 = [2.0, 2.0]$ . Which class is predicted using squared Euclidean distance?

ExerciseAdvanced

Problem

Derive the FOMAML meta-gradient by dropping the Hessian term from the full MAML meta-gradient. Under what conditions does the dropped term vanish exactly (making FOMAML exact)?

References

Canonical:

Finn, Abbeel, Levine, "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks", ICML 2017
Snell, Swersky, Zemel, "Prototypical Networks for Few-shot Learning", NeurIPS 2017

Current:

Nichol, Achiam, Schulman, "On First-Order Meta-Learning Algorithms" (Reptile), 2018
Garg et al., "What Can Transformers Learn In-Context? A Case Study of Simple Function Classes", NeurIPS 2022
Hospedales et al., "Meta-Learning in Neural Networks: A Survey", IEEE TPAMI 2022

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Feedforward Networks and Backpropagationlayer 2 · tier 1
Test-Time Training and Adaptive Inferencelayer 5 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.