Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Context Engineering

The discipline of building, routing, compressing, retrieving, and persisting context for LLMs: beyond prompt design into systems engineering for what the model sees.

AdvancedTier 2Frontier~55 min

Why This Matters

Every production LLM system is a context engineering system. The model itself is frozen at inference time. The only lever you have is what you put in the context window. System prompts, few-shot examples, retrieved documents, tool outputs, conversation history, structured metadata: the arrangement and selection of these components determines whether the model gives a useful answer or hallucinates confidently.

Prompt engineering is a subset of context engineering. Prompt engineering asks "how do I phrase my question?" Context engineering asks "what information should the model see, in what order, in what format, and how do I get it there efficiently?"

Mental Model

Think of the context window as a fixed-size workbench. The model can only reason about what is on the workbench. Context engineering is the discipline of deciding what goes on the workbench, how it is organized, and what gets removed when space runs out.

The workbench has a hard size limit (the context window length), but the effective capacity is much smaller than the nominal limit due to attention degradation over long sequences.

Core Concepts

Definition

Context Window

The context window is the maximum number of tokens a model can process in a single forward pass. For modern LLMs this ranges from 4K to 1M+ tokens. The context window includes all input: system prompt, conversation history, retrieved documents, and the current query.

Definition

Context Engineering

Context engineering is the practice of designing systems that construct, select, compress, route, and manage the content of the context window to maximize task performance. It encompasses retrieval-augmented generation (RAG), prompt design, memory systems, tool-use orchestration, and context compression.

Definition

Retrieval-Augmented Generation (RAG)

RAG retrieves relevant documents from an external knowledge base and injects them into the context window before generation. This decouples the model's knowledge from its parameters. The model reasons over retrieved evidence rather than relying on memorized training data.

The Components of Context

A production context window typically contains these layers, in order:

  1. System prompt: persistent instructions defining the model's role, constraints, and output format
  2. Retrieved documents: evidence fetched via semantic search, keyword search, or structured queries
  3. Tool outputs: results from function calls, API responses, database queries
  4. Conversation history: prior turns, potentially summarized or truncated
  5. Current query: the user's immediate request

The ordering matters. Attention is not uniform across the context.

Main Theorems

Proposition

Lost in the Middle

Statement

When relevant information is placed in the middle of a long context, LLM performance degrades significantly compared to when the same information appears at the beginning or end. Performance follows a U-shaped curve as a function of the position of the relevant document.

Intuition

Decoder-only transformers attend most strongly to the first tokens (due to the attention sink phenomenon) and the most recent tokens (due to recency in causal attention). Information in the middle competes with more tokens for attention weight and is effectively "lost". The model processes it but fails to use it for downstream reasoning.

Why It Matters

This result demolishes the naive strategy of "just dump everything into a long context window." Placement and ordering of information matter as much as its presence. Context engineering must account for positional attention bias when constructing the context.

Failure Mode

The effect varies across models and is mitigated by some architectures (e.g., models trained with position-aware objectives). Do not assume all models exhibit identical U-shaped degradation. test your specific model.

RAG vs. Long Context

Watch Out

Long context does not replace retrieval

With 1M-token context windows, a common reaction is "just put everything in the context. no need for RAG." This is wrong for multiple reasons:

  1. Lost in the middle: the model cannot attend equally to all positions
  2. Cost: attention is O(n2)O(n^2) or O(nlogn)O(n \log n); longer context is linearly or super-linearly more expensive per query
  3. Noise dilution: irrelevant documents in context degrade performance even if the relevant ones are present
  4. Latency: time-to-first-token scales with context length

RAG with a shorter, focused context typically outperforms raw long-context stuffing on knowledge-intensive tasks.

Chunking Strategies

How you split documents for retrieval affects what the model sees.

Fixed-size chunking: Split every kk tokens. Simple but breaks semantic boundaries. Overlapping windows (stride <k< k) partially mitigate this.

Semantic chunking: Split at paragraph or section boundaries. Preserves meaning but produces variable-length chunks.

Hierarchical chunking: Store documents at multiple granularities (sentence, paragraph, section). Retrieve at the granularity appropriate to the query.

Agentic chunking: Let a model decide chunk boundaries based on content structure. More expensive, more accurate.

The right strategy depends on the retrieval model, the document type, and the downstream task. There is no universal best chunking method.

Context Compression

When the context window fills up, you need compression strategies:

  • Summarization: condense conversation history or retrieved documents
  • Selective retention: keep only the most relevant prior turns
  • KV-cache pruning: drop attention states for less important tokens
  • Embedding-based memory: store compressed representations of past context and retrieve them as needed

Each introduces information loss. The engineering question is: which information can you afford to lose?

Build It This Way by Default

For most production systems, structured context with retrieval beats raw long-context stuffing. Start with: (1) a well-designed system prompt, (2) RAG with semantic search over chunked documents, (3) conversation history limited to the last kk turns plus a running summary, (4) tool outputs injected at the point of use. Only move to full long-context stuffing when you have verified it outperforms this pipeline on your specific task.

Common Confusions

Watch Out

Context engineering is not prompt engineering

Prompt engineering optimizes the phrasing of the user-facing query. Context engineering designs the entire pipeline that determines what the model sees: retrieval systems, memory management, context assembly, compression, tool orchestration. A great prompt in a badly engineered context still fails.

Watch Out

More context is not always better

Adding marginally relevant documents to the context often hurts performance. The model attends to noise, gets confused by contradictory passages, or simply loses track of the relevant information among the clutter. Precision in retrieval matters more than recall.

Summary

  • The context window is the only interface between your system and the frozen model. everything the model knows at inference time comes through context
  • Positional bias in attention means placement order matters
  • RAG with focused retrieval typically outperforms naive long-context stuffing
  • Chunking strategy, compression policy, and context assembly order are engineering decisions with large impact on performance
  • Context engineering is a systems discipline, not a copywriting exercise

Exercises

ExerciseCore

Problem

A model has a 128K-token context window. Your system prompt uses 2K tokens, conversation history uses 8K tokens, and the user query is 500 tokens. You retrieve 10 document chunks of 1K tokens each. What fraction of the context window are you using, and where should you place the most relevant chunk?

ExerciseAdvanced

Problem

You have a RAG pipeline where each query retrieves the top-20 chunks from a corpus. You notice that increasing to top-50 decreases answer accuracy on your eval set. Explain why, and propose two mitigation strategies.

ExerciseResearch

Problem

Design a context compression scheme for multi-turn conversations that preserves the ability to reference specific earlier statements while keeping total context under a fixed budget BB tokens. What information-theoretic tradeoff are you making?

References

Canonical:

  • Liu et al., "Lost in the Middle: How Language Models Use Long Contexts" (2023)
  • Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (2020)

Current:

  • Anthropic, "Long context prompting tips" (2024)
  • Xu et al., "Retrieval meets Long Context Large Language Models" (2024)

Next Topics

The natural next steps from context engineering:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics