Prompt Engineering and In-Context Learning

Sneiderman, Robby

LLM Construction

Prompt Engineering and In-Context Learning

In-context learning allows LLMs to adapt to new tasks from examples in the prompt without weight updates. Theories for why it works, prompting strategies, and why prompt engineering is configuring inference-time computation.

AdvancedTier 2CurrentSupporting~50 min

Prerequisites

Transformer Architecture

Prereq Map

Learning position

Read this page in the graph.

llm-construction | layer 5 | tier 2. This page has 1 direct prerequisite and 5 published dependents.

Open Atlas Prerequisites Leads to

What next

Context Engineering

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Large language models can solve tasks they were never explicitly trained on, simply by conditioning on examples in the prompt. This is in-context learning (ICL). No gradient updates, no fine-tuning. The model reads a few input-output pairs and generalizes.

This capability is surprising. A frozen model with fixed weights should not be able to "learn" at inference time. Understanding ICL is necessary for understanding why prompting works and when it will fail.

Mental Model

Think of a pretrained LLM as a system that has already seen millions of tasks during pretraining. Each task appeared as a sequence: some context followed by a completion. When you provide few-shot examples in a prompt, you are not teaching the model a new skill. You are helping it locate a skill it already has, by giving it enough context to identify which task distribution you want it to perform.

Core Definitions

Definition

In-Context Learning $I C L$

Given a pretrained language model $p_\theta$ with fixed parameters $\theta$ , in-context learning is the ability to produce correct outputs for a task by conditioning on a prompt containing $k$ input-output demonstrations $(x_1, y_1), \ldots, (x_k, y_k)$ followed by a query $x_{k+1}$ , without any parameter update:

$\hat{y}_{k+1} = \arg\max_y \; p_\theta(y \mid x_1, y_1, \ldots, x_k, y_k, x_{k+1})$

Definition

Few-Shot, One-Shot, Zero-Shot

$k$ -shot prompting provides $k$ demonstrations before the query. Zero-shot provides only a task instruction. One-shot provides exactly one example. Performance typically improves with $k$ up to a saturation point determined by context window size and task complexity.

Definition

Chain-of-Thought Prompting

Chain-of-thought (CoT) prompting includes intermediate reasoning steps in the demonstrations. Instead of $(x_i, y_i)$ , each example has the form $(x_i, r_i, y_i)$ where $r_i$ is a reasoning trace. This improves performance on tasks requiring multi-step reasoning: arithmetic, logic, and word problems.

Main Theorems

Theorem

ICL as Implicit Gradient Descent

Statement

For a single-layer linear self-attention transformer trained on linear regression tasks, the forward pass on $k$ in-context examples implements an algorithm equivalent to one step of gradient descent on the least-squares loss over the demonstrations. Consider the scalar-output case: demonstrations define a regression problem $y_i = w^\top x_i + \text{noise}$ with $w, x_i \in \mathbb{R}^d$ and $y_i \in \mathbb{R}$ . The transformer output approximates:

$\hat{w} = w_0 - \eta \sum_{i=1}^{k} (w_0^\top x_i - y_i) x_i$

a column vector in $\mathbb{R}^d$ , where $w_0$ and $\eta$ are implicitly determined by the trained attention weights. See Akyurek et al. (2023, arXiv:2211.15661) and von Oswald et al. (2023, arXiv:2212.07677) for the full derivation.

Intuition

The attention mechanism computes a weighted sum over past tokens. For linear attention, this weighted sum has the same algebraic form as a gradient descent update on a least-squares objective. The model does not literally run gradient descent. Its forward pass produces the same result.

Proof Sketch

Write out the single-layer linear attention operation: $\text{output} = W_V X \cdot \text{softmax}(X^\top W_K^\top W_Q x_{\text{query}})$ . For the linear case, the softmax reduces to a linear operation. Expand and rearrange to match the form of a single gradient step on $\sum_i (w^\top x_i - y_i)^2$ . Akyurek et al. (2023) and von Oswald et al. (2023) provide the full derivation.

Why It Matters

This gives a mechanistic explanation for ICL: transformers can implement learning algorithms in their forward pass. The trained weights encode an optimizer, not just a function. This connects ICL to meta-learning: the pretraining phase trains the optimizer, and the prompt runs it.

Failure Mode

This result is proven for single-layer linear attention on linear regression tasks. Multi-layer nonlinear transformers on natural language tasks are far more complex. The gradient descent analogy may not extend cleanly to realistic settings. Empirical evidence suggests multi-layer transformers implement more sophisticated algorithms than single gradient steps.

report a correction →

Theories for Why ICL Works

Three main theories, not mutually exclusive:

1. Bayesian inference in the forward pass. Xie et al. (2022) model pretraining data as generated by a mixture of latent concepts. The prompt examples specify a concept (task), and the transformer performs approximate Bayesian inference to identify which concept generated the data. Under this view, ICL is posterior predictive inference, not learning.

2. Implicit gradient descent. As formalized above. The transformer forward pass implements optimization steps on the in-context examples. This view is strongest for simple architectures and tasks.

3. Task location in the pretraining distribution. The model has seen similar tasks during pretraining. The prompt helps the model identify which pretraining task distribution to emulate. Under this view, ICL performance is bounded by the diversity of the pretraining data.

Prompt Engineering is Inference-Time Configuration

Prompt engineering is not "asking the model nicely." It is configuring the input to a function to control its output. Specific techniques:

System prompts set the behavioral frame. They persist across turns and constrain the model's output distribution. A system prompt saying "You are a JSON API" does not change the model. It changes the conditional distribution $p(y \mid \text{system}, x)$ .

Structured output formatting. Providing a schema or example output format constrains the model to produce parseable outputs. This works because the autoregressive generation process conditions on all previously generated tokens, including formatting tokens.

Role assignment. Saying "You are an expert in X" activates different regions of the pretraining distribution. This is not anthropomorphism. It is conditional generation from a different part of the learned distribution.

Common Confusions

Watch Out

ICL is not fine-tuning

In fine-tuning, the model weights $\theta$ change. In ICL, the weights are frozen. The model's behavior changes because the input changes, not the model. This distinction matters: ICL cannot permanently alter the model, and its effects disappear when the prompt changes.

Watch Out

More examples do not always help

ICL performance can degrade with too many examples. This happens when examples overflow the effective context window, when examples are noisy or contradictory, or when the model attends to surface patterns (label frequency) rather than input-output mappings. Min et al. (2022) showed that for some tasks, the format of demonstrations matters more than the correctness of the labels.

Watch Out

Chain-of-thought helps reasoning, not retrieval

CoT improves tasks requiring multi-step computation (math, logic) but provides little benefit for tasks that are primarily recall or pattern matching. CoT works by giving the model intermediate tokens to condition on, effectively expanding the computation budget. It does not improve the model's knowledge base.

Canonical Examples

Example

Few-shot sentiment classification

Prompt: "Review: Great film. Sentiment: Positive. Review: Terrible waste of time. Sentiment: Negative. Review: The acting was superb but the plot dragged. Sentiment:" The model outputs "Positive" or "Mixed" depending on how it weighs the demonstrations. The few-shot examples define the label space and the mapping from surface features to labels.

Summary

ICL lets frozen models adapt to tasks at inference time via prompt conditioning
Theoretical explanations include Bayesian inference, implicit gradient descent, and task location
Few-shot examples help the model identify the task, not learn it from scratch
Chain-of-thought prompting expands the model's effective computation budget
Prompt engineering is configuring $p(y \mid \text{prompt}, x)$ , not persuasion
ICL performance is bounded by what the model learned during pretraining

Exercises

ExerciseCore

Problem

A model achieves 85% accuracy on a classification task with 5-shot prompting. You add 5 more examples (10-shot) and accuracy drops to 78%. Propose two hypotheses for why this happened.

ExerciseAdvanced

Problem

The ICL-as-gradient-descent result holds for single-layer linear attention. Name two specific ways in which a multi-layer transformer with softmax attention could implement a more powerful learning algorithm than a single gradient step.

References

Canonical:

Brown et al., "Language Models are Few-Shot Learners" (GPT-3), NeurIPS 2020; arXiv:2005.14165. Sections 1-3 introduce the few-shot framing.
Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", NeurIPS 2022; arXiv:2201.11903. Sections 2-4.

Theoretical analyses of ICL (scope limited; results are architecture- and task-specific):

Akyurek, Schuurmans, Andreas, Ma, Zhou, "What Learning Algorithm Is In-Context Learning? Investigations with Linear Models", ICLR 2023; arXiv:2211.15661.
von Oswald, Niklasson, Randazzo, Sacramento, Mordvintsev, Zhmoginov, Vladymyrov, "Transformers Learn In-Context by Gradient Descent", ICML 2023; arXiv:2212.07677.
Xie, Raghunathan, Liang, Ma, "An Explanation of In-Context Learning as Implicit Bayesian Inference", ICLR 2022; arXiv:2111.02080.
Bai, Chen, Wang, Xiong, Mei, "Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection", NeurIPS 2023; arXiv:2306.04637. Multi-step and algorithm-selection extensions.

Empirical surprises:

Min, Lyu, Holtzman, Artetxe, Lewis, Hajishirzi, Zettlemoyer, "Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?", EMNLP 2022; arXiv:2202.12837. Correct labels matter less than format for several tasks.
Lu, Bartolo, Moore, Riedel, Stenetorp, "Fantastically Ordered Prompts and Where to Find Them", ACL 2022; arXiv:2104.08786. ICL performance is highly sensitive to demonstration order.

Next Topics

Context engineering: systems-level design of what goes into the prompt
Hallucination theory: when and why ICL-based generation produces false outputs

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Transformer Architecturelayer 4 · tier 2

Derived topics

5

Tabular Foundation Models as Bayesian Inference Engineslayer 3 · tier 1
Chain-of-Thought and Reasoninglayer 5 · tier 1
Context Engineeringlayer 5 · tier 2
GPT Series Evolutionlayer 5 · tier 2
Tool-Augmented Reasoninglayer 5 · tier 2

Graph-backed continuations

Context Engineering Chain-of-Thought and Reasoning GPT Series Evolution Tabular Foundation Models as Bayesian Inference Engines Tool-Augmented Reasoning