Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Systems

Information Retrieval Foundations

Search as a first-class capability. TF-IDF, BM25, inverted indexes, precision/recall, reranking, and why retrieval is not just vector DB plus embeddings.

CoreTier 1Stable~45 min
0

Why This Matters

Every RAG system, every search engine, every recommendation pipeline with a retrieval stage depends on information retrieval. The retrieval component is not magic. It is a well-studied subfield with decades of theory, evaluation methodology, and practical engineering. If you build systems that find relevant documents given a query, you are doing IR whether you call it that or not.

Understanding IR means understanding why BM25 still beats many neural retrievers on keyword-heavy queries, why evaluation metrics matter more than model architecture choices, and why the full retrieval stack (indexing, candidate generation, reranking, evaluation) cannot be replaced by a single embedding model.

The Retrieval Problem

Given a query qq and a corpus C={d1,d2,,dN}\mathcal{C} = \{d_1, d_2, \ldots, d_N\} of NN documents, return a ranked list of documents ordered by decreasing relevance to qq. The corpus may contain millions or billions of documents. The system must respond in milliseconds.

This imposes two constraints: (1) you need a scoring function that approximates relevance, and (2) you need data structures that avoid scanning every document for every query.

Core Definitions

Definition

Bag-of-Words Representation

A document dd is represented as a vector dRV\mathbf{d} \in \mathbb{R}^{|V|} where VV is the vocabulary. Each component counts or weights the corresponding term. Word order is discarded. Despite this crude simplification, bag-of-words models remain competitive for many retrieval tasks because query-document relevance often depends more on which terms appear than on their arrangement.

Definition

Term Frequency

The term frequency of term tt in document dd is the number of times tt appears in dd. Raw term frequency is typically transformed (e.g., by taking log(1+tf)\log(1 + \text{tf})) to dampen the effect of very high counts. A term appearing 100 times is not 100 times more important than a term appearing once.

Definition

Inverse Document Frequency

The inverse document frequency of term tt measures how informative tt is across the corpus:

idf(t)=logNdf(t)\text{idf}(t) = \log \frac{N}{\text{df}(t)}

where NN is the total number of documents and df(t)\text{df}(t) is the number of documents containing tt. Rare terms (low df\text{df}) get high IDF; common terms (high df\text{df}) get low IDF. The word "the" has near-zero IDF; a rare technical term has high IDF.

Definition

TF-IDF

The TF-IDF score combines term frequency and inverse document frequency:

tfidf(t,d)=tf(t,d)idf(t)\text{tfidf}(t, d) = \text{tf}(t, d) \cdot \text{idf}(t)

A term scores high if it appears frequently in document dd (high TF) and rarely in the corpus overall (high IDF). The relevance score for query qq and document dd is the sum over query terms:

score(q,d)=tqtfidf(t,d)\text{score}(q, d) = \sum_{t \in q} \text{tfidf}(t, d)

Definition

BM25 (Okapi BM25)

BM25 is the standard probabilistic retrieval function. For a query qq with terms t1,,tmt_1, \ldots, t_m and a document dd:

BM25(q,d)=i=1midf(ti)tf(ti,d)(k1+1)tf(ti,d)+k1(1b+bdavgdl)\text{BM25}(q, d) = \sum_{i=1}^{m} \text{idf}(t_i) \cdot \frac{\text{tf}(t_i, d) \cdot (k_1 + 1)}{\text{tf}(t_i, d) + k_1 \cdot \left(1 - b + b \cdot \frac{|d|}{\text{avgdl}}\right)}

where k1k_1 controls term frequency saturation (typically k1[1.2,2.0]k_1 \in [1.2, 2.0]), bb controls document length normalization (typically b=0.75b = 0.75), d|d| is the document length, and avgdl\text{avgdl} is the average document length in the corpus.

Definition

Inverted Index

An inverted index maps each term tt in the vocabulary to a posting list: the sorted list of document IDs containing tt, along with term frequencies and positions. Given a query, the system looks up posting lists for each query term and merges them, avoiding a full corpus scan. Building the index takes O(NL)O(N \cdot L) time where LL is average document length. Query processing takes O(tqdf(t))O(\sum_{t \in q} \text{df}(t)) time, which is typically much less than NN.

Why BM25 Improves on TF-IDF

BM25 addresses two problems with raw TF-IDF:

  1. Term frequency saturation. In TF-IDF, doubling the term count doubles the score. BM25 uses a saturating function: the score increases with TF but asymptotes. The first occurrence of a term matters most. The parameter k1k_1 controls how quickly saturation occurs. Setting k1=0k_1 = 0 ignores TF entirely (binary term presence). Setting k1k_1 \to \infty recovers raw TF.

  2. Document length normalization. Longer documents naturally contain more term occurrences. BM25 penalizes documents that are longer than average via the bb parameter. Setting b=0b = 0 disables length normalization. Setting b=1b = 1 fully normalizes by document length.

BM25 was derived from the probabilistic relevance model (Robertson and Sparck Jones, 1976) under independence assumptions similar to naive Bayes.

The Probability Ranking Principle

Theorem

Probability Ranking Principle

Statement

If documents are ranked by decreasing probability of relevance P(relq,d)P(\text{rel} \mid q, d), then the expected loss (under a binary relevance model with uniform cost of errors) is minimized at every rank cutoff. That is, ranking by P(relq,d)P(\text{rel} \mid q, d) is optimal for any cutoff kk.

Intuition

If you must present documents one at a time, always present the one most likely to be relevant next. This greedy strategy is globally optimal because the relevance of each document is assumed independent of the others. There is no benefit to showing a less-relevant document first to "set up" a more-relevant one later.

Proof Sketch

Consider two documents did_i and djd_j with P(relq,di)>P(relq,dj)P(\text{rel} \mid q, d_i) > P(\text{rel} \mid q, d_j). If djd_j appears before did_i in the ranking, swapping them increases the expected number of relevant documents in the top-kk for every kk that includes exactly one of them. By induction on the number of such swaps, the decreasing-probability ranking is optimal.

Why It Matters

The PRP provides the theoretical justification for probabilistic retrieval models. It says the ranking problem reduces to estimating P(relq,d)P(\text{rel} \mid q, d) for each document. BM25, language models for IR, and neural rankers can all be viewed as different ways to estimate this probability. The PRP tells you what to estimate; the models tell you how.

Failure Mode

The independence assumption fails when documents interact. In a diverse information need (the user wants multiple subtopics covered), showing two documents about the same subtopic is worse than showing one from each, even if both have high P(rel)P(\text{rel}). This motivates diversified retrieval methods like MMR (Maximal Marginal Relevance), which trades off relevance and diversity. The PRP also assumes binary relevance; graded relevance requires a different analysis.

Evaluation Metrics

Definition

Precision and Recall

For a retrieved set RR and the set of relevant documents Rel\text{Rel}:

Precision=RRelR,Recall=RRelRel\text{Precision} = \frac{|R \cap \text{Rel}|}{|R|}, \quad \text{Recall} = \frac{|R \cap \text{Rel}|}{|\text{Rel}|}

Precision measures the fraction of retrieved documents that are relevant. Recall measures the fraction of relevant documents that are retrieved. There is a fundamental tradeoff: retrieving more documents increases recall but typically decreases precision.

Definition

Mean Average Precision (MAP)

For a ranked list, average precision is the mean of precision values at each rank where a relevant document appears:

AP=1Relk=1NPrecision(k)1[dkRel]\text{AP} = \frac{1}{|\text{Rel}|} \sum_{k=1}^{N} \text{Precision}(k) \cdot \mathbb{1}[d_k \in \text{Rel}]

MAP is the mean of AP over all queries in the evaluation set. MAP rewards systems that rank relevant documents early and penalizes systems that scatter relevant documents across the ranking.

Definition

Normalized Discounted Cumulative Gain (NDCG)

NDCG handles graded relevance (not just binary). For relevance labels rel1,rel2,\text{rel}_1, \text{rel}_2, \ldots at ranks 1,2,1, 2, \ldots:

DCG@k=i=1k2reli1log2(i+1)\text{DCG}@k = \sum_{i=1}^{k} \frac{2^{\text{rel}_i} - 1}{\log_2(i + 1)}

NDCG@k=DCG@kIDCG@k\text{NDCG}@k = \frac{\text{DCG}@k}{\text{IDCG}@k}

where IDCG@k\text{IDCG}@k is the DCG of the ideal ranking. NDCG ranges from 0 to 1. The logarithmic discount reflects the assumption that users are less likely to examine results at lower ranks.

Definition

Recall at K

Recall@K\text{Recall}@K is the fraction of relevant documents in the top KK results. In a two-stage retrieval pipeline (retriever + reranker), the retriever's Recall@K\text{Recall}@K is the ceiling for the full system: the reranker cannot surface documents that the retriever missed. If your retriever has Recall@100=0.80\text{Recall}@100 = 0.80, your system cannot achieve recall above 0.80 regardless of the reranker.

Query Expansion and Pseudo-Relevance Feedback

Query expansion adds terms to the original query to bridge the vocabulary mismatch problem: the user says "automobile" but the relevant document says "car." Classical methods use pseudo-relevance feedback (PRF): run the original query, assume the top-kk results are relevant, extract discriminative terms from those results, and add them to the query for a second retrieval pass. PRF can improve recall significantly but can also cause topic drift when the initial results are poor.

Learning to Rank

Learning-to-rank (LTR) uses supervised learning to combine many ranking signals (BM25 score, PageRank, document freshness, query-document features) into a single relevance score. Three paradigms:

  • Pointwise: predict relevance for each document independently (regression or classification on each query-document pair).
  • Pairwise: predict which of two documents is more relevant (e.g., RankNet, LambdaRank). The loss operates on document pairs.
  • Listwise: optimize a metric over the entire ranked list (e.g., LambdaMART optimizes NDCG directly via gradient approximation).

LambdaMART (gradient boosted trees with lambda gradients) remains a strong baseline. It won most of the Yahoo Learning to Rank Challenge (2010) and continues to be deployed in production search systems.

Neural Reranking and Dense Retrieval

Cross-encoder reranking. A cross-encoder takes the concatenated (query, document) pair as input and produces a relevance score. Because it attends to both query and document jointly, it captures fine-grained interactions. The cost: it must run inference for every candidate document, making it too slow for first-stage retrieval over millions of documents. It is used as a second-stage reranker on the top 100-1000 candidates from a fast first-stage retriever.

Dense retrieval (bi-encoder). Encode query and document separately into dense vectors q,dRd\mathbf{q}, \mathbf{d} \in \mathbb{R}^d. Relevance is sim(q,d)=qd\text{sim}(\mathbf{q}, \mathbf{d}) = \mathbf{q} \cdot \mathbf{d}. Because document embeddings are precomputed, retrieval reduces to approximate nearest neighbor (ANN) search. This is fast but loses the fine-grained interaction that cross-encoders capture. ColBERT uses multi-vector representations (one vector per token) with late interaction to balance speed and quality.

Connections to RAG

Retrieval-augmented generation (RAG) uses a retriever to fetch relevant documents and feeds them to a language model as context. The retrieval component is an IR system. The quality of the generated answer is bounded by the quality of the retrieved context. Bad retrieval produces hallucinated or irrelevant answers regardless of the generator quality.

A typical RAG pipeline: query \to BM25 or dense retriever (top 100) \to cross-encoder reranker (top 5-10) \to LLM with retrieved context. Each stage is a standard IR component.

Hybrid Search

Hybrid search combines lexical (BM25) and dense (embedding) retrieval. The standard fusion method is Reciprocal Rank Fusion (RRF):

RRF(d)=rrankers1k+rankr(d)\text{RRF}(d) = \sum_{r \in \text{rankers}} \frac{1}{k + \text{rank}_r(d)}

where kk is a constant (typically 60) and rankr(d)\text{rank}_r(d) is the rank of document dd in ranker rr's list. RRF is parameter-free (aside from kk), does not require score calibration between rankers, and consistently outperforms either ranker alone across diverse benchmarks.

Common Confusions

Watch Out

Retrieval is not just vector DB plus embeddings

BM25 lexical search, inverted indexes, query expansion, reranking, and evaluation metrics are foundational IR components. Dense retrieval adds to this stack; it does not replace it. On keyword-heavy queries, exact lexical match matters. On entity names and rare terms, BM25 often outperforms dense retrieval. Production systems use hybrid search (BM25 + dense + reranker) because no single method dominates across all query types.

Watch Out

High recall at the retrieval stage is non-negotiable

A reranker can reorder candidates but cannot recover documents that the retriever missed. If your first-stage retriever returns 100 candidates with Recall@100=0.70\text{Recall}@100 = 0.70, your system can never achieve recall above 0.70. Investing in retriever quality (better BM25 tuning, query expansion, hybrid search) often matters more than a fancier reranker.

Watch Out

NDCG and MAP measure different things

MAP assumes binary relevance and rewards getting all relevant documents to the top. NDCG handles graded relevance (some documents are more relevant than others) and uses a logarithmic discount. Use NDCG when relevance is graded. Use MAP when relevance is binary. Do not compare MAP values across datasets with different relevance scales.

Key Takeaways

  • TF-IDF weights terms by frequency in the document and rarity in the corpus
  • BM25 improves on TF-IDF with term frequency saturation and length normalization
  • Inverted indexes enable sub-linear query processing over large corpora
  • The probability ranking principle says ranking by P(relq,d)P(\text{rel} \mid q, d) is optimal under independence
  • Evaluation requires precision, recall, MAP, and NDCG; each captures different aspects
  • Hybrid search (BM25 + dense retrieval + reranking) outperforms any single method
  • Retriever recall is the ceiling for the entire pipeline

Exercises

ExerciseCore

Problem

A corpus has N=10,000N = 10{,}000 documents. Term "neural" appears in 50 documents. Term "the" appears in 9,900 documents. Compute the IDF of each term using idf(t)=log10(N/df(t))\text{idf}(t) = \log_{10}(N / \text{df}(t)). Explain why the ratio is intuitive.

ExerciseCore

Problem

A retrieval system returns 10 documents for a query. Documents at ranks 1, 3, 5, and 8 are relevant. There are 6 relevant documents in total. Compute precision@10, recall@10, and average precision (AP).

ExerciseAdvanced

Problem

In the BM25 formula, analyze the limiting behavior. What happens to the scoring function as k10k_1 \to 0? As k1k_1 \to \infty? As b0b \to 0? As b1b \to 1? Explain the practical implications of each limit.

ExerciseAdvanced

Problem

You have a retriever with Recall@100=0.85\text{Recall}@100 = 0.85 and a cross-encoder reranker. The reranker is perfect (it ranks all relevant documents in the top-100 above all irrelevant ones). What is the maximum possible Recall@10\text{Recall}@10 of the full pipeline? What does this imply about where to invest effort?

References

Canonical:

  • Manning, Raghavan, Schutze, Introduction to Information Retrieval (2008), Chapters 1-2, 6-8
  • Robertson and Zaragoza, "The Probabilistic Relevance Framework: BM25 and Beyond" (2009)
  • Robertson, "The Probability Ranking Principle in IR" (1977)

Evaluation:

  • Jarvelin and Kekalainen, "Cumulated Gain-Based Evaluation of IR Techniques" (2002), introduces NDCG

Modern:

  • Karpukhin et al., "Dense Passage Retrieval for Open-Domain Question Answering" (DPR, 2020)
  • Khattab and Zaharia, "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction" (2020)
  • Cormack et al., "Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods" (2009)

Next Topics

  • Word embeddings: dense vector representations that enable semantic similarity for dense retrieval

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics