Context Engineering

Sneiderman, Robby

LLM Construction

Context Engineering

The discipline of building, routing, compressing, retrieving, and persisting context for LLMs: beyond prompt design into systems engineering for what the model sees.

AdvancedTier 2FrontierFrontier watch~55 min

Prerequisites

KV Cache Attention Mechanism Theory Agent Protocols Mcp A2a Attention Sinks and Retrieval Decay

Start 8-question practice · 3 available Prereq Map

Learning position

Read this page in the graph.

llm-construction | layer 5 | tier 2. This page has 10 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Speculative Decoding and Quantization

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Every production LLM system is a context engineering system. The model itself is frozen at inference time. The only lever you have is what you put in the context window. System prompts, few-shot examples, retrieved documents, tool outputs, conversation history, structured metadata: the arrangement and selection of these components determines whether the model gives a useful answer or hallucinates confidently.

Prompt engineering is a subset of context engineering. Prompt engineering asks "how do I phrase my question?" Context engineering asks "what information should the model see, in what order, in what format, and how do I get it there efficiently?"

Mental Model

Think of the context window as a fixed-size workbench. The model can only reason about what is on the workbench. Context engineering is the discipline of deciding what goes on the workbench, how it is organized, and what gets removed when space runs out.

The workbench has a hard size limit (the context window length), but the effective capacity is much smaller than the nominal limit due to attention degradation over long sequences.

Current Checkpoint

As of May 2026, context engineering is no longer just a production nickname for prompt templates. The research framing has converged on three separable layers:

Context retrieval and generation: search, tool calls, query rewriting, synthetic context, and multi-hop retrieval.
Context processing: ranking, compression, deduplication, summarization, citation packing, and conflict resolution.
Context management: memory, state, permissions, freshness, and routing across agents or tools.

The important product lesson is that context engineering has become a control surface. A system that retrieves the right evidence but packs it badly can still fail. A system that has a long context window but no memory policy will be expensive and brittle. A system that stores memories without permission and freshness metadata will eventually inject stale or unsafe state.

Build It This Way by Default

Treat every context item as a typed record, not a loose text blob. Track: source, timestamp, permission boundary, retrieval score, intended use, compression status, and whether it is evidence, instruction, tool output, or user memory. This makes debugging and audit possible when the model gives a wrong answer.

Core Concepts

Definition

Context Window

The context window is the maximum number of tokens a model can process in a single forward pass. For modern LLMs this ranges from 4K to 1M+ tokens. The context window includes all input: system prompt, conversation history, retrieved documents, and the current query.

Definition

Context Engineering

Context engineering is the practice of designing systems that construct, select, compress, route, and manage the content of the context window to maximize task performance. It encompasses retrieval-augmented generation (RAG), prompt design, memory systems, tool-use orchestration, and context compression.

Definition

Retrieval-Augmented Generation (RAG)

RAG retrieves relevant documents from an external knowledge base and injects them into the context window before generation. This decouples the model's knowledge from its parameters. The model reasons over retrieved evidence rather than relying on memorized training data.

The Components of Context

A production context window typically contains these layers, in order:

System prompt: persistent instructions defining the model's role, constraints, and output format
Retrieved documents: evidence fetched via semantic search, keyword search, or structured queries
Tool outputs: results from function calls, API responses, database queries
Conversation history: prior turns, potentially summarized or truncated
Current query: the user's immediate request

The ordering matters. Attention is not uniform across the context.

Main Theorems

Proposition

Lost in the Middle

Statement

When relevant information is placed in the middle of a long context, LLM performance degrades significantly compared to when the same information appears at the beginning or end. Performance follows a U-shaped curve as a function of the position of the relevant document.

Intuition

Decoder-only transformers attend most strongly to the first tokens (due to the attention sink phenomenon) and the most recent tokens (due to recency in causal attention). Information in the middle competes with more tokens for attention weight and is effectively "lost". The model processes it but fails to use it for downstream reasoning.

Why It Matters

This result undermines the naive strategy of "just stuff everything into a long context window." Placement and ordering of information matter as much as its presence. Context engineering must account for positional attention bias when constructing the context.

Failure Mode

The effect varies across models and is mitigated by some architectures (e.g., models trained with position-aware objectives). Do not assume all models exhibit identical U-shaped degradation. test your specific model.

report a correction →

RAG vs. Long Context

Watch Out

Long context does not replace retrieval

With 1M-token context windows, a common reaction is "just put everything in the context. no need for RAG." This is wrong for multiple reasons:

Lost in the middle: the model cannot attend equally to all positions
Cost: attention is $O(n^2)$ or $O(n \log n)$ ; longer context is linearly or super-linearly more expensive per query
Noise dilution: irrelevant documents in context degrade performance even if the relevant ones are present
Latency: time-to-first-token scales with context length

RAG with a shorter, focused context typically outperforms raw long-context stuffing on knowledge-intensive tasks.

Chunking Strategies

How you split documents for retrieval affects what the model sees.

Fixed-size chunking: Split every $k$ tokens. Simple but breaks semantic boundaries. Overlapping windows (stride $< k$ ) partially mitigate this.

Semantic chunking: Split at paragraph or section boundaries. Preserves meaning but produces variable-length chunks.

Hierarchical chunking: Store documents at multiple granularities (sentence, paragraph, section). Retrieve at the granularity appropriate to the query.

Agentic chunking: Let a model decide chunk boundaries based on content structure. More expensive, more accurate.

The right strategy depends on the retrieval model, the document type, and the downstream task. There is no universal best chunking method.

Context Compression

When the context window fills up, you need compression strategies:

Summarization: condense conversation history or retrieved documents
Selective retention: keep only the most relevant prior turns
KV-cache pruning: drop attention states for less important tokens
Embedding-based memory: store compressed representations of past context and retrieve them as needed

Each introduces information loss. The engineering question is: which information can you afford to lose?

Build It This Way by Default

For most production systems, structured context with retrieval beats raw long-context stuffing. Start with: (1) a well-designed system prompt, (2) RAG with semantic search over chunked documents, (3) conversation history limited to the last $k$ turns plus a running summary, (4) tool outputs injected at the point of use. Only move to full long-context stuffing when you have verified it outperforms this pipeline on your specific task.

Common Confusions

Watch Out

Context engineering is not prompt engineering

Prompt engineering optimizes the phrasing of the user-facing query. Context engineering designs the entire pipeline that determines what the model sees: retrieval systems, memory management, context assembly, compression, tool orchestration. A great prompt in a badly engineered context still fails.

Watch Out

More context is not always better

Adding marginally relevant documents to the context often hurts performance. The model attends to noise, gets confused by contradictory passages, or simply loses track of the relevant information among the clutter. Precision in retrieval matters more than recall.

Summary

The context window is the only interface between your system and the frozen model. everything the model knows at inference time comes through context
Positional bias in attention means placement order matters
RAG with focused retrieval typically outperforms naive long-context stuffing
Chunking strategy, compression policy, and context assembly order are engineering decisions with large impact on performance
Context engineering is a systems discipline, not a copywriting exercise

Exercises

ExerciseCore

Problem

A model has a 128K-token context window. Your system prompt uses 2K tokens, conversation history uses 8K tokens, and the user query is 500 tokens. You retrieve 10 document chunks of 1K tokens each. What fraction of the context window are you using, and where should you place the most relevant chunk?

ExerciseAdvanced

Problem

You have a RAG pipeline where each query retrieves the top-20 chunks from a corpus. You notice that increasing to top-50 decreases answer accuracy on your eval set. Explain why, and propose two mitigation strategies.

ExerciseResearch

Problem

Design a context compression scheme for multi-turn conversations that preserves the ability to reference specific earlier statements while keeping total context under a fixed budget $B$ tokens. What information-theoretic tradeoff are you making?

References

Canonical:

Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," TACL 2024 (arXiv:2307.03172). The positional-attention study cited throughout this page; see also attention sinks and retrieval decay for the mechanism.
Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," NeurIPS 2020 (arXiv:2005.11401) — the original RAG paper.
Karpukhin et al., "Dense Passage Retrieval for Open-Domain Question Answering," EMNLP 2020 (arXiv:2004.04906). DPR, the canonical dense-retrieval reference.
Khattab & Zaharia, "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT," SIGIR 2020 (arXiv:2004.12832). Late-interaction retrieval.

RAG ecosystem:

Mei et al., "A Survey of Context Engineering for Large Language Models" (2025, arXiv:2507.13334) — current taxonomy for retrieval/generation, processing, management, and system-level integrations.
Gao et al., "Retrieval-Augmented Generation for Large Language Models: A Survey" (2023, arXiv:2312.10997) — taxonomy of naive, advanced, and modular RAG.
Gao et al., "Precise Zero-Shot Dense Retrieval without Relevance Labels" (HyDE, 2022, arXiv:2212.10496). Hypothetical document embedding.
Asai et al., "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection," ICLR 2024 (arXiv:2310.11511)
Packer et al., "MemGPT: Towards LLMs as Operating Systems" (2023, arXiv:2310.08560). Virtual-memory abstraction over the context window.
Edge et al., "From Local to Global: A Graph RAG Approach to Query-Focused Summarization" (2024, arXiv:2404.16130). GraphRAG.
Anthropic, "Introducing Contextual Retrieval" (2024 blog, anthropic.com/news/contextual-retrieval) — chunk-level context prepending.

Current:

Anthropic, "Long context prompting tips" (2024)
Xu et al., "Retrieval meets Long Context Large Language Models," ICLR 2024 (arXiv:2310.03025)

Next Topics

The natural next steps from context engineering:

Speculative decoding and quantization: making inference fast enough to serve these context-heavy systems
Hallucination theory: understanding when context fails to prevent confabulation

Last reviewed: May 25, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

10

Semantic Search and Embeddingslayer 3 · tier 2
Attention Mechanism Theorylayer 4 · tier 2
Attention Sinks and Retrieval Decaylayer 4 · tier 2
Mamba and State-Space Modelslayer 4 · tier 2
Sparse Attention and Long Contextlayer 4 · tier 2

Derived topics

2

Memory Systems for LLMslayer 5 · tier 2
Multimodal RAGlayer 5 · tier 2

Graph-backed continuations

Memory Systems for LLMs Multimodal RAG