Information Retrieval Foundations

Sneiderman, Robby

Algorithms Foundations

Information Retrieval Foundations

Search as a first-class capability. TF-IDF, BM25, inverted indexes, precision/recall, reranking, and why retrieval is not just vector DB plus embeddings.

CoreTier 1StableCore spine~45 min

Prerequisites

Common Probability Distributions Basic Logic and Proof Techniques Fuzzy Matching and Record Linkage

Start 8-question practice · 4 available Prereq Map

Learning position

Read this page in the graph.

algorithms-foundations | layer 2 | tier 1. This page has 2 direct prerequisites and 3 published dependents.

Open Atlas Prerequisites Leads to

What next

Word Embeddings

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Every RAG system, every search engine, every recommendation pipeline with a retrieval stage depends on information retrieval. The retrieval component is not magic. It is a well-studied subfield with decades of theory, evaluation methodology, and practical engineering. If you build systems that find relevant documents given a query, you are doing IR whether you call it that or not.

Understanding IR means understanding why BM25 still beats many neural retrievers on keyword-heavy queries, why evaluation metrics matter more than model architecture choices, and why the full retrieval stack (indexing, candidate generation, reranking, evaluation) cannot be replaced by a single embedding model.

The Retrieval Problem

Given a query $q$ and a corpus $\mathcal{C} = \{d_1, d_2, \ldots, d_N\}$ of $N$ documents, return a ranked list of documents ordered by decreasing relevance to $q$ . The corpus may contain millions or billions of documents. The system must respond in milliseconds.

This imposes two constraints: (1) you need a scoring function that approximates relevance, and (2) you need data structures that avoid scanning every document for every query.

Core Definitions

Definition

Bag-of-Words Representation

A document $d$ is represented as a vector $\mathbf{d} \in \mathbb{R}^{|V|}$ where $V$ is the vocabulary. Each component counts or weights the corresponding term. Word order is discarded. Despite this crude simplification, bag-of-words models remain competitive for many retrieval tasks because query-document relevance often depends more on which terms appear than on their arrangement.

Definition

Term Frequency $tf (t, d)$

The term frequency of term $t$ in document $d$ is the number of times $t$ appears in $d$ . Raw term frequency is typically transformed (e.g., by taking $\log(1 + \text{tf})$ ) to dampen the effect of very high counts. A term appearing 100 times is not 100 times more important than a term appearing once.

Definition

Inverse Document Frequency $idf (t)$

The inverse document frequency of term $t$ measures how informative $t$ is across the corpus:

$\text{idf}(t) = \log \frac{N}{\text{df}(t)}$

where $N$ is the total number of documents and $\text{df}(t)$ is the number of documents containing $t$ . Rare terms (low $\text{df}$ ) get high IDF; common terms (high $\text{df}$ ) get low IDF. The word "the" has near-zero IDF; a rare technical term has high IDF.

Definition

TF-IDF $tfidf (t, d)$

The TF-IDF score combines term frequency and inverse document frequency:

$\text{tfidf}(t, d) = \text{tf}(t, d) \cdot \text{idf}(t)$

A term scores high if it appears frequently in document $d$ (high TF) and rarely in the corpus overall (high IDF). The relevance score for query $q$ and document $d$ is the sum over query terms:

$\text{score}(q, d) = \sum_{t \in q} \text{tfidf}(t, d)$

Definition

BM25 (Okapi BM25)

BM25 is the standard probabilistic retrieval function. For a query $q$ with terms $t_1, \ldots, t_m$ and a document $d$ :

$\text{BM25}(q, d) = \sum_{i=1}^{m} \text{idf}(t_i) \cdot \frac{\text{tf}(t_i, d) \cdot (k_1 + 1)}{\text{tf}(t_i, d) + k_1 \cdot \left(1 - b + b \cdot \frac{|d|}{\text{avgdl}}\right)}$

where $k_1$ controls term frequency saturation (typically $k_1 \in [1.2, 2.0]$ ), $b$ controls document length normalization (typically $b = 0.75$ ), $|d|$ is the document length, and $\text{avgdl}$ is the average document length in the corpus.

Definition

Inverted Index

An inverted index maps each term $t$ in the vocabulary to a posting list: the sorted list of document IDs containing $t$ , along with term frequencies and positions. Given a query, the system looks up posting lists for each query term and merges them, avoiding a full corpus scan. Building the index takes $O(N \cdot L)$ time where $L$ is average document length. Query processing takes $O(\sum_{t \in q} \text{df}(t))$ time, which is typically much less than $N$ .

Why BM25 Improves on TF-IDF

BM25 addresses two problems with raw TF-IDF:

Term frequency saturation. In TF-IDF, doubling the term count doubles the score. BM25 uses a saturating function: the score increases with TF but asymptotes. The first occurrence of a term matters most. The parameter $k_1$ controls how quickly saturation occurs. Setting $k_1 = 0$ ignores TF entirely (binary term presence). Setting $k_1 \to \infty$ recovers raw TF.
Document length normalization. Longer documents naturally contain more term occurrences. BM25 penalizes documents that are longer than average via the $b$ parameter. Setting $b = 0$ disables length normalization. Setting $b = 1$ fully normalizes by document length.

BM25 was derived from the probabilistic relevance model (Robertson and Sparck Jones, 1976) under independence assumptions similar to naive Bayes.

The Probability Ranking Principle

Theorem

Probability Ranking Principle

Statement

If documents are ranked by decreasing probability of relevance $P(\text{rel} \mid q, d)$ , then the expected loss (under a binary relevance model with uniform cost of errors) is minimized at every rank cutoff. That is, ranking by $P(\text{rel} \mid q, d)$ is optimal for any cutoff $k$ .

Intuition

If you must present documents one at a time, always present the one most likely to be relevant next. This greedy strategy is globally optimal because the relevance of each document is assumed independent of the others. There is no benefit to showing a less-relevant document first to "set up" a more-relevant one later.

Proof Sketch

Consider two documents $d_i$ and $d_j$ with $P(\text{rel} \mid q, d_i) > P(\text{rel} \mid q, d_j)$ . If $d_j$ appears before $d_i$ in the ranking, swapping them increases the expected number of relevant documents in the top- $k$ for every $k$ that includes exactly one of them. By induction on the number of such swaps, the decreasing-probability ranking is optimal.

Why It Matters

The PRP provides the theoretical justification for probabilistic retrieval models. It says the ranking problem reduces to estimating $P(\text{rel} \mid q, d)$ for each document. BM25, language models for IR, and neural rankers can all be viewed as different ways to estimate this probability. The PRP tells you what to estimate; the models tell you how.

Failure Mode

The independence assumption fails when documents interact. In a diverse information need (the user wants multiple subtopics covered), showing two documents about the same subtopic is worse than showing one from each, even if both have high $P(\text{rel})$ . This motivates diversified retrieval methods like MMR (Maximal Marginal Relevance), which trades off relevance and diversity. The PRP also assumes binary relevance; graded relevance requires a different analysis.

report a correction →

Evaluation Metrics

Definition

Precision and Recall

For a retrieved set $R$ and the set of relevant documents $\text{Rel}$ :

$\text{Precision} = \frac{|R \cap \text{Rel}|}{|R|}, \quad \text{Recall} = \frac{|R \cap \text{Rel}|}{|\text{Rel}|}$

Precision measures the fraction of retrieved documents that are relevant. Recall measures the fraction of relevant documents that are retrieved. There is a fundamental tradeoff: retrieving more documents increases recall but typically decreases precision.

Definition

Mean Average Precision (MAP)

For a ranked list, average precision is the mean of precision values at each rank where a relevant document appears:

$\text{AP} = \frac{1}{|\text{Rel}|} \sum_{k=1}^{N} \text{Precision}(k) \cdot \mathbb{1}[d_k \in \text{Rel}]$

MAP is the mean of AP over all queries in the evaluation set. MAP rewards systems that rank relevant documents early and penalizes systems that scatter relevant documents across the ranking.

Definition

Normalized Discounted Cumulative Gain (NDCG)

NDCG handles graded relevance (not just binary). For relevance labels $\text{rel}_1, \text{rel}_2, \ldots$ at ranks $i=1, 2, \ldots$ :

$\text{DCG}@k = \sum_{i=1}^{k} \frac{2^{\text{rel}_i} - 1}{\log_2(i + 1)}$

$\text{NDCG}@k = \frac{\text{DCG}@k}{\text{IDCG}@k}$

where $\text{IDCG}@k$ is the DCG of the ideal ranking. NDCG ranges from 0 to 1. The logarithmic discount reflects the assumption that users are less likely to examine results at lower ranks.

Definition

Recall at K

$\text{Recall}@K$ is the fraction of relevant documents in the top $K$ results. In a two-stage retrieval pipeline (retriever + reranker), the retriever's $\text{Recall}@K$ is the ceiling for the full system: the reranker cannot surface documents that the retriever missed. If your retriever has $\text{Recall}@100 = 0.80$ , your system cannot achieve recall above 0.80 regardless of the reranker.

Query Expansion and Pseudo-Relevance Feedback

Query expansion adds terms to the original query to bridge the vocabulary mismatch problem: the user says "automobile" but the relevant document says "car." Classical methods use pseudo-relevance feedback (PRF): run the original query, assume the top- $k$ results are relevant, extract discriminative terms from those results, and add them to the query for a second retrieval pass. PRF can improve recall significantly but can also cause topic drift when the initial results are poor.

Learning to Rank

Learning-to-rank (LTR) uses supervised learning to combine many ranking signals (BM25 score, PageRank, document freshness, query-document features) into a single relevance score. Three paradigms:

Pointwise: predict relevance for each document independently (regression or classification on each query-document pair).
Pairwise: predict which of two documents is more relevant (e.g., RankNet, LambdaRank). The loss operates on document pairs.
Listwise: optimize a metric over the entire ranked list (e.g., LambdaMART optimizes NDCG directly via gradient approximation).

LambdaMART (gradient boosted trees with lambda gradients) remains a strong baseline. It won most of the Yahoo Learning to Rank Challenge (2010) and continues to be deployed in production search systems.

Neural Reranking and Dense Retrieval

Cross-encoder reranking. A cross-encoder takes the concatenated (query, document) pair as input and produces a relevance score. Because the attention operates jointly over query and document, it captures fine-grained interactions. The cost: it must run inference for every candidate document, making it too slow for first-stage retrieval over millions of documents. It is used as a second-stage reranker on the top 100-1000 candidates from a fast first-stage retriever.

Dense retrieval (bi-encoder). Encode query and document separately into dense vectors $\mathbf{q}, \mathbf{d} \in \mathbb{R}^d$ (typically via the same family of representations described in word embeddings). Relevance is $\text{sim}(\mathbf{q}, \mathbf{d}) = \mathbf{q} \cdot \mathbf{d}$ . Because document embeddings are precomputed, retrieval reduces to approximate nearest neighbor (ANN) search. This is fast but loses the fine-grained interaction that cross-encoders capture. ColBERT uses multi-vector representations (one vector per token) with late interaction to balance speed and quality.

Connections to RAG

Retrieval-augmented generation (RAG) uses a retriever to fetch relevant documents and feeds them to a language model as context. The retrieval component is an IR system. The quality of the generated answer is bounded by the quality of the retrieved context. Bad retrieval produces hallucinated or irrelevant answers regardless of the generator quality.

A typical RAG pipeline: query $\to$ BM25 or dense retriever (top 100) $\to$ cross-encoder reranker (top 5-10) $\to$ LLM with retrieved context. Each stage is a standard IR component.

Hybrid Search

Hybrid search combines lexical (BM25) and dense (embedding) retrieval. The standard fusion method is Reciprocal Rank Fusion (RRF):

$\text{RRF}(d) = \sum_{r \in \text{rankers}} \frac{1}{k + \text{rank}_r(d)}$

where $k$ is a constant (typically 60) and $\text{rank}_r(d)$ is the rank of document $d$ in ranker $r$ 's list. RRF is parameter-free (aside from $k$ ), does not require score calibration between rankers, and consistently outperforms either ranker alone across diverse benchmarks.

Common Confusions

Watch Out

Retrieval is not just vector DB plus embeddings

BM25 lexical search, inverted indexes, query expansion, reranking, and evaluation metrics are foundational IR components. Dense retrieval adds to this stack; it does not replace it. On keyword-heavy queries, exact lexical match matters. On entity names and rare terms, BM25 often outperforms dense retrieval. Production systems use hybrid search (BM25 + dense + reranker) because no single method dominates across all query types.

Watch Out

High recall at the retrieval stage is non-negotiable

A reranker can reorder candidates but cannot recover documents that the retriever missed. If your first-stage retriever returns 100 candidates with $\text{Recall}@100 = 0.70$ , your system can never achieve recall above 0.70. Investing in retriever quality (better BM25 tuning, query expansion, hybrid search) often matters more than a fancier reranker.

Watch Out

NDCG and MAP measure different things

MAP assumes binary relevance and rewards getting all relevant documents to the top. NDCG handles graded relevance (some documents are more relevant than others) and uses a logarithmic discount. Use NDCG when relevance is graded. Use MAP when relevance is binary. Do not compare MAP values across datasets with different relevance scales.

Summary

TF-IDF weights terms by frequency in the document and rarity in the corpus
BM25 improves on TF-IDF with term frequency saturation and length normalization
Inverted indexes enable sub-linear query processing over large corpora
The probability ranking principle says ranking by $P(\text{rel} \mid q, d)$ is optimal under independence
Evaluation requires precision, recall, MAP, and NDCG; each captures different aspects
Hybrid search (BM25 + dense retrieval + reranking) outperforms any single method
Retriever recall is the ceiling for the entire pipeline

Exercises

ExerciseCore

Problem

A corpus has $N = 10{,}000$ documents. Term "neural" appears in 50 documents. Term "the" appears in 9,900 documents. Compute the IDF of each term using $\text{idf}(t) = \log_{10}(N / \text{df}(t))$ . Explain why the ratio is intuitive.

ExerciseCore

Problem

A retrieval system returns 10 documents for a query. Documents at ranks 1, 3, 5, and 8 are relevant. There are 6 relevant documents in total. Compute precision@10, recall@10, and average precision (AP).

ExerciseAdvanced

Problem

In the BM25 formula, analyze the limiting behavior. What happens to the scoring function as $k_1 \to 0$ ? As $k_1 \to \infty$ ? As $b \to 0$ ? As $b \to 1$ ? Explain the practical implications of each limit.

ExerciseAdvanced

Problem

You have a retriever with $\text{Recall}@100 = 0.85$ and a cross-encoder reranker. The reranker is perfect (it ranks all relevant documents in the top-100 above all irrelevant ones). What is the maximum possible $\text{Recall}@10$ of the full pipeline? What does this imply about where to invest effort?

References

Canonical:

Manning, Raghavan, Schutze, Introduction to Information Retrieval (2008), Chapters 1-2, 6-8
Robertson and Zaragoza, "The Probabilistic Relevance Framework: BM25 and Beyond" (2009)
Robertson, "The Probability Ranking Principle in IR" (1977)

Evaluation:

Jarvelin and Kekalainen, "Cumulated Gain-Based Evaluation of IR Techniques" (2002), introduces NDCG

Modern:

Karpukhin et al., "Dense Passage Retrieval for Open-Domain Question Answering" (DPR, 2020)
Khattab and Zaharia, "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction" (2020)
Cormack et al., "Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods" (2009)

Next Topics

Word embeddings: dense vector representations that enable semantic similarity for dense retrieval

Last reviewed: April 15, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Common Probability Distributionslayer 0A · tier 1
Basic Logic and Proof Techniqueslayer 0A · tier 2

Derived topics

3

Word Embeddingslayer 2 · tier 2
Semantic Search and Embeddingslayer 3 · tier 2
Cohere Modelslayer 4 · tier 2

Graph-backed continuations

Word Embeddings Cohere Models Semantic Search and Embeddings