Systems
Information Retrieval Foundations
Search as a first-class capability. TF-IDF, BM25, inverted indexes, precision/recall, reranking, and why retrieval is not just vector DB plus embeddings.
Why This Matters
Every RAG system, every search engine, every recommendation pipeline with a retrieval stage depends on information retrieval. The retrieval component is not magic. It is a well-studied subfield with decades of theory, evaluation methodology, and practical engineering. If you build systems that find relevant documents given a query, you are doing IR whether you call it that or not.
Understanding IR means understanding why BM25 still beats many neural retrievers on keyword-heavy queries, why evaluation metrics matter more than model architecture choices, and why the full retrieval stack (indexing, candidate generation, reranking, evaluation) cannot be replaced by a single embedding model.
The Retrieval Problem
Given a query and a corpus of documents, return a ranked list of documents ordered by decreasing relevance to . The corpus may contain millions or billions of documents. The system must respond in milliseconds.
This imposes two constraints: (1) you need a scoring function that approximates relevance, and (2) you need data structures that avoid scanning every document for every query.
Core Definitions
Bag-of-Words Representation
A document is represented as a vector where is the vocabulary. Each component counts or weights the corresponding term. Word order is discarded. Despite this crude simplification, bag-of-words models remain competitive for many retrieval tasks because query-document relevance often depends more on which terms appear than on their arrangement.
Term Frequency
The term frequency of term in document is the number of times appears in . Raw term frequency is typically transformed (e.g., by taking ) to dampen the effect of very high counts. A term appearing 100 times is not 100 times more important than a term appearing once.
Inverse Document Frequency
The inverse document frequency of term measures how informative is across the corpus:
where is the total number of documents and is the number of documents containing . Rare terms (low ) get high IDF; common terms (high ) get low IDF. The word "the" has near-zero IDF; a rare technical term has high IDF.
TF-IDF
The TF-IDF score combines term frequency and inverse document frequency:
A term scores high if it appears frequently in document (high TF) and rarely in the corpus overall (high IDF). The relevance score for query and document is the sum over query terms:
BM25 (Okapi BM25)
BM25 is the standard probabilistic retrieval function. For a query with terms and a document :
where controls term frequency saturation (typically ), controls document length normalization (typically ), is the document length, and is the average document length in the corpus.
Inverted Index
An inverted index maps each term in the vocabulary to a posting list: the sorted list of document IDs containing , along with term frequencies and positions. Given a query, the system looks up posting lists for each query term and merges them, avoiding a full corpus scan. Building the index takes time where is average document length. Query processing takes time, which is typically much less than .
Why BM25 Improves on TF-IDF
BM25 addresses two problems with raw TF-IDF:
-
Term frequency saturation. In TF-IDF, doubling the term count doubles the score. BM25 uses a saturating function: the score increases with TF but asymptotes. The first occurrence of a term matters most. The parameter controls how quickly saturation occurs. Setting ignores TF entirely (binary term presence). Setting recovers raw TF.
-
Document length normalization. Longer documents naturally contain more term occurrences. BM25 penalizes documents that are longer than average via the parameter. Setting disables length normalization. Setting fully normalizes by document length.
BM25 was derived from the probabilistic relevance model (Robertson and Sparck Jones, 1976) under independence assumptions similar to naive Bayes.
The Probability Ranking Principle
Probability Ranking Principle
Statement
If documents are ranked by decreasing probability of relevance , then the expected loss (under a binary relevance model with uniform cost of errors) is minimized at every rank cutoff. That is, ranking by is optimal for any cutoff .
Intuition
If you must present documents one at a time, always present the one most likely to be relevant next. This greedy strategy is globally optimal because the relevance of each document is assumed independent of the others. There is no benefit to showing a less-relevant document first to "set up" a more-relevant one later.
Proof Sketch
Consider two documents and with . If appears before in the ranking, swapping them increases the expected number of relevant documents in the top- for every that includes exactly one of them. By induction on the number of such swaps, the decreasing-probability ranking is optimal.
Why It Matters
The PRP provides the theoretical justification for probabilistic retrieval models. It says the ranking problem reduces to estimating for each document. BM25, language models for IR, and neural rankers can all be viewed as different ways to estimate this probability. The PRP tells you what to estimate; the models tell you how.
Failure Mode
The independence assumption fails when documents interact. In a diverse information need (the user wants multiple subtopics covered), showing two documents about the same subtopic is worse than showing one from each, even if both have high . This motivates diversified retrieval methods like MMR (Maximal Marginal Relevance), which trades off relevance and diversity. The PRP also assumes binary relevance; graded relevance requires a different analysis.
Evaluation Metrics
Precision and Recall
For a retrieved set and the set of relevant documents :
Precision measures the fraction of retrieved documents that are relevant. Recall measures the fraction of relevant documents that are retrieved. There is a fundamental tradeoff: retrieving more documents increases recall but typically decreases precision.
Mean Average Precision (MAP)
For a ranked list, average precision is the mean of precision values at each rank where a relevant document appears:
MAP is the mean of AP over all queries in the evaluation set. MAP rewards systems that rank relevant documents early and penalizes systems that scatter relevant documents across the ranking.
Normalized Discounted Cumulative Gain (NDCG)
NDCG handles graded relevance (not just binary). For relevance labels at ranks :
where is the DCG of the ideal ranking. NDCG ranges from 0 to 1. The logarithmic discount reflects the assumption that users are less likely to examine results at lower ranks.
Recall at K
is the fraction of relevant documents in the top results. In a two-stage retrieval pipeline (retriever + reranker), the retriever's is the ceiling for the full system: the reranker cannot surface documents that the retriever missed. If your retriever has , your system cannot achieve recall above 0.80 regardless of the reranker.
Query Expansion and Pseudo-Relevance Feedback
Query expansion adds terms to the original query to bridge the vocabulary mismatch problem: the user says "automobile" but the relevant document says "car." Classical methods use pseudo-relevance feedback (PRF): run the original query, assume the top- results are relevant, extract discriminative terms from those results, and add them to the query for a second retrieval pass. PRF can improve recall significantly but can also cause topic drift when the initial results are poor.
Learning to Rank
Learning-to-rank (LTR) uses supervised learning to combine many ranking signals (BM25 score, PageRank, document freshness, query-document features) into a single relevance score. Three paradigms:
- Pointwise: predict relevance for each document independently (regression or classification on each query-document pair).
- Pairwise: predict which of two documents is more relevant (e.g., RankNet, LambdaRank). The loss operates on document pairs.
- Listwise: optimize a metric over the entire ranked list (e.g., LambdaMART optimizes NDCG directly via gradient approximation).
LambdaMART (gradient boosted trees with lambda gradients) remains a strong baseline. It won most of the Yahoo Learning to Rank Challenge (2010) and continues to be deployed in production search systems.
Neural Reranking and Dense Retrieval
Cross-encoder reranking. A cross-encoder takes the concatenated (query, document) pair as input and produces a relevance score. Because it attends to both query and document jointly, it captures fine-grained interactions. The cost: it must run inference for every candidate document, making it too slow for first-stage retrieval over millions of documents. It is used as a second-stage reranker on the top 100-1000 candidates from a fast first-stage retriever.
Dense retrieval (bi-encoder). Encode query and document separately into dense vectors . Relevance is . Because document embeddings are precomputed, retrieval reduces to approximate nearest neighbor (ANN) search. This is fast but loses the fine-grained interaction that cross-encoders capture. ColBERT uses multi-vector representations (one vector per token) with late interaction to balance speed and quality.
Connections to RAG
Retrieval-augmented generation (RAG) uses a retriever to fetch relevant documents and feeds them to a language model as context. The retrieval component is an IR system. The quality of the generated answer is bounded by the quality of the retrieved context. Bad retrieval produces hallucinated or irrelevant answers regardless of the generator quality.
A typical RAG pipeline: query BM25 or dense retriever (top 100) cross-encoder reranker (top 5-10) LLM with retrieved context. Each stage is a standard IR component.
Hybrid Search
Hybrid search combines lexical (BM25) and dense (embedding) retrieval. The standard fusion method is Reciprocal Rank Fusion (RRF):
where is a constant (typically 60) and is the rank of document in ranker 's list. RRF is parameter-free (aside from ), does not require score calibration between rankers, and consistently outperforms either ranker alone across diverse benchmarks.
Common Confusions
Retrieval is not just vector DB plus embeddings
BM25 lexical search, inverted indexes, query expansion, reranking, and evaluation metrics are foundational IR components. Dense retrieval adds to this stack; it does not replace it. On keyword-heavy queries, exact lexical match matters. On entity names and rare terms, BM25 often outperforms dense retrieval. Production systems use hybrid search (BM25 + dense + reranker) because no single method dominates across all query types.
High recall at the retrieval stage is non-negotiable
A reranker can reorder candidates but cannot recover documents that the retriever missed. If your first-stage retriever returns 100 candidates with , your system can never achieve recall above 0.70. Investing in retriever quality (better BM25 tuning, query expansion, hybrid search) often matters more than a fancier reranker.
NDCG and MAP measure different things
MAP assumes binary relevance and rewards getting all relevant documents to the top. NDCG handles graded relevance (some documents are more relevant than others) and uses a logarithmic discount. Use NDCG when relevance is graded. Use MAP when relevance is binary. Do not compare MAP values across datasets with different relevance scales.
Key Takeaways
- TF-IDF weights terms by frequency in the document and rarity in the corpus
- BM25 improves on TF-IDF with term frequency saturation and length normalization
- Inverted indexes enable sub-linear query processing over large corpora
- The probability ranking principle says ranking by is optimal under independence
- Evaluation requires precision, recall, MAP, and NDCG; each captures different aspects
- Hybrid search (BM25 + dense retrieval + reranking) outperforms any single method
- Retriever recall is the ceiling for the entire pipeline
Exercises
Problem
A corpus has documents. Term "neural" appears in 50 documents. Term "the" appears in 9,900 documents. Compute the IDF of each term using . Explain why the ratio is intuitive.
Problem
A retrieval system returns 10 documents for a query. Documents at ranks 1, 3, 5, and 8 are relevant. There are 6 relevant documents in total. Compute precision@10, recall@10, and average precision (AP).
Problem
In the BM25 formula, analyze the limiting behavior. What happens to the scoring function as ? As ? As ? As ? Explain the practical implications of each limit.
Problem
You have a retriever with and a cross-encoder reranker. The reranker is perfect (it ranks all relevant documents in the top-100 above all irrelevant ones). What is the maximum possible of the full pipeline? What does this imply about where to invest effort?
References
Canonical:
- Manning, Raghavan, Schutze, Introduction to Information Retrieval (2008), Chapters 1-2, 6-8
- Robertson and Zaragoza, "The Probabilistic Relevance Framework: BM25 and Beyond" (2009)
- Robertson, "The Probability Ranking Principle in IR" (1977)
Evaluation:
- Jarvelin and Kekalainen, "Cumulated Gain-Based Evaluation of IR Techniques" (2002), introduces NDCG
Modern:
- Karpukhin et al., "Dense Passage Retrieval for Open-Domain Question Answering" (DPR, 2020)
- Khattab and Zaharia, "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction" (2020)
- Cormack et al., "Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods" (2009)
Next Topics
- Word embeddings: dense vector representations that enable semantic similarity for dense retrieval
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A