Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Prompt Engineering and In-Context Learning

In-context learning allows LLMs to adapt to new tasks from examples in the prompt without weight updates. Theories for why it works, prompting strategies, and why prompt engineering is configuring inference-time computation.

AdvancedTier 2Current~50 min
0

Why This Matters

Large language models can solve tasks they were never explicitly trained on, simply by conditioning on examples in the prompt. This is in-context learning (ICL). No gradient updates, no fine-tuning. The model reads a few input-output pairs and generalizes.

This capability is surprising. A frozen model with fixed weights should not be able to "learn" at inference time. Understanding ICL is necessary for understanding why prompting works and when it will fail.

Mental Model

Think of a pretrained LLM as a system that has already seen millions of tasks during pretraining. Each task appeared as a sequence: some context followed by a completion. When you provide few-shot examples in a prompt, you are not teaching the model a new skill. You are helping it locate a skill it already has, by giving it enough context to identify which task distribution you want it to perform.

Core Definitions

Definition

In-Context Learning

Given a pretrained language model pθp_\theta with fixed parameters θ\theta, in-context learning is the ability to produce correct outputs for a task by conditioning on a prompt containing kk input-output demonstrations (x1,y1),,(xk,yk)(x_1, y_1), \ldots, (x_k, y_k) followed by a query xk+1x_{k+1}, without any parameter update:

y^k+1=argmaxy  pθ(yx1,y1,,xk,yk,xk+1)\hat{y}_{k+1} = \arg\max_y \; p_\theta(y \mid x_1, y_1, \ldots, x_k, y_k, x_{k+1})

Definition

Few-Shot, One-Shot, Zero-Shot

kk-shot prompting provides kk demonstrations before the query. Zero-shot provides only a task instruction. One-shot provides exactly one example. Performance typically improves with kk up to a saturation point determined by context window size and task complexity.

Definition

Chain-of-Thought Prompting

Chain-of-thought (CoT) prompting includes intermediate reasoning steps in the demonstrations. Instead of (xi,yi)(x_i, y_i), each example has the form (xi,ri,yi)(x_i, r_i, y_i) where rir_i is a reasoning trace. This improves performance on tasks requiring multi-step reasoning: arithmetic, logic, and word problems.

Main Theorems

Theorem

ICL as Implicit Gradient Descent

Statement

For a single-layer linear self-attention transformer trained on linear regression tasks, the forward pass on kk in-context examples implements an algorithm equivalent to one step of gradient descent on the least-squares loss over the demonstrations. Concretely, if the demonstrations define a regression problem yi=wTxiy_i = w^T x_i, the transformer output approximates:

w^=W0ηi=1k(W0Txiyi)xiT\hat{w} = W_0 - \eta \sum_{i=1}^{k} (W_0^T x_i - y_i) x_i^T

where W0W_0 and η\eta are implicitly determined by the trained attention weights.

Intuition

The attention mechanism computes a weighted sum over past tokens. For linear attention, this weighted sum has the same algebraic form as a gradient descent update on a least-squares objective. The model does not literally run gradient descent. Its forward pass produces the same result.

Proof Sketch

Write out the single-layer linear attention operation: output=WVXsoftmax(XTWKTWQxquery)\text{output} = W_V X \cdot \text{softmax}(X^T W_K^T W_Q x_{\text{query}}). For the linear case, the softmax reduces to a linear operation. Expand and rearrange to match the form of a single gradient step on iwTxiyi2\sum_i \|w^T x_i - y_i\|^2. Akyurek et al. (2023) and von Oswald et al. (2023) provide the full derivation.

Why It Matters

This gives a mechanistic explanation for ICL: transformers can implement learning algorithms in their forward pass. The trained weights encode an optimizer, not just a function. This connects ICL to meta-learning: the pretraining phase trains the optimizer, and the prompt runs it.

Failure Mode

This result is proven for single-layer linear attention on linear regression tasks. Multi-layer nonlinear transformers on natural language tasks are far more complex. The gradient descent analogy may not extend cleanly to realistic settings. Empirical evidence suggests multi-layer transformers implement more sophisticated algorithms than single gradient steps.

Theories for Why ICL Works

Three main theories, not mutually exclusive:

1. Bayesian inference in the forward pass. Xie et al. (2022) model pretraining data as generated by a mixture of latent concepts. The prompt examples specify a concept (task), and the transformer performs approximate Bayesian inference to identify which concept generated the data. Under this view, ICL is posterior predictive inference, not learning.

2. Implicit gradient descent. As formalized above. The transformer forward pass implements optimization steps on the in-context examples. This view is strongest for simple architectures and tasks.

3. Task location in the pretraining distribution. The model has seen similar tasks during pretraining. The prompt helps the model identify which pretraining task distribution to emulate. Under this view, ICL performance is bounded by the diversity of the pretraining data.

Prompt Engineering is Inference-Time Configuration

Prompt engineering is not "asking the model nicely." It is configuring the input to a function to control its output. Specific techniques:

System prompts set the behavioral frame. They persist across turns and constrain the model's output distribution. A system prompt saying "You are a JSON API" does not change the model. It changes the conditional distribution p(ysystem,x)p(y \mid \text{system}, x).

Structured output formatting. Providing a schema or example output format constrains the model to produce parseable outputs. This works because the autoregressive generation process conditions on all previously generated tokens, including formatting tokens.

Role assignment. Saying "You are an expert in X" activates different regions of the pretraining distribution. This is not anthropomorphism. It is conditional generation from a different part of the learned distribution.

Common Confusions

Watch Out

ICL is not fine-tuning

In fine-tuning, the model weights θ\theta change. In ICL, the weights are frozen. The model's behavior changes because the input changes, not the model. This distinction matters: ICL cannot permanently alter the model, and its effects disappear when the prompt changes.

Watch Out

More examples do not always help

ICL performance can degrade with too many examples. This happens when examples overflow the effective context window, when examples are noisy or contradictory, or when the model attends to surface patterns (label frequency) rather than input-output mappings. Min et al. (2022) showed that for some tasks, the format of demonstrations matters more than the correctness of the labels.

Watch Out

Chain-of-thought helps reasoning, not retrieval

CoT improves tasks requiring multi-step computation (math, logic) but provides little benefit for tasks that are primarily recall or pattern matching. CoT works by giving the model intermediate tokens to condition on, effectively expanding the computation budget. It does not improve the model's knowledge base.

Canonical Examples

Example

Few-shot sentiment classification

Prompt: "Review: Great film. Sentiment: Positive. Review: Terrible waste of time. Sentiment: Negative. Review: The acting was superb but the plot dragged. Sentiment:" The model outputs "Positive" or "Mixed" depending on how it weighs the demonstrations. The few-shot examples define the label space and the mapping from surface features to labels.

Summary

  • ICL lets frozen models adapt to tasks at inference time via prompt conditioning
  • Theoretical explanations include Bayesian inference, implicit gradient descent, and task location
  • Few-shot examples help the model identify the task, not learn it from scratch
  • Chain-of-thought prompting expands the model's effective computation budget
  • Prompt engineering is configuring p(yprompt,x)p(y \mid \text{prompt}, x), not persuasion
  • ICL performance is bounded by what the model learned during pretraining

Exercises

ExerciseCore

Problem

A model achieves 85% accuracy on a classification task with 5-shot prompting. You add 5 more examples (10-shot) and accuracy drops to 78%. Propose two hypotheses for why this happened.

ExerciseAdvanced

Problem

The ICL-as-gradient-descent result holds for single-layer linear attention. Name two specific ways in which a multi-layer transformer with softmax attention could implement a more powerful learning algorithm than a single gradient step.

References

Canonical:

  • Brown et al., "Language Models are Few-Shot Learners" (GPT-3), NeurIPS 2020; arXiv:2005.14165. Sections 1-3 introduce the few-shot framing.
  • Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", NeurIPS 2022; arXiv:2201.11903. Sections 2-4.

Theoretical analyses of ICL (scope limited; results are architecture- and task-specific):

  • Akyurek, Schuurmans, Andreas, Ma, Zhou, "What Learning Algorithm Is In-Context Learning? Investigations with Linear Models", ICLR 2023; arXiv:2211.15661.
  • von Oswald, Niklasson, Randazzo, Sacramento, Mordvintsev, Zhmoginov, Vladymyrov, "Transformers Learn In-Context by Gradient Descent", ICML 2023; arXiv:2212.07677.
  • Xie, Raghunathan, Liang, Ma, "An Explanation of In-Context Learning as Implicit Bayesian Inference", ICLR 2022; arXiv:2111.02080.
  • Bai, Chen, Wang, Xiong, Mei, "Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection", NeurIPS 2023; arXiv:2306.04637. Multi-step and algorithm-selection extensions.

Empirical surprises:

  • Min, Lyu, Holtzman, Artetxe, Lewis, Hajishirzi, Zettlemoyer, "Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?", EMNLP 2022; arXiv:2202.12837. Correct labels matter less than format for several tasks.
  • Lu, Bartolo, Moore, Riedel, Stenetorp, "Fantastically Ordered Prompts and Where to Find Them", ACL 2022; arXiv:2104.08786. ICL performance is highly sensitive to demonstration order.

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics