Semantic Search and Embeddings

Sneiderman, Robby

ML Methods

Semantic Search and Embeddings

Dense vector representations for semantic similarity: bi-encoders, cross-encoders, approximate nearest neighbor search, cosine similarity geometry, and the RAG retrieval pipeline.

CoreTier 2CurrentSupporting~50 min

Prerequisites

Word Embeddings Inner Product Spaces and Orthogonality Information Retrieval Fuzzy Matching and Record Linkage

Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 3 | tier 2. This page has 4 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Multimodal RAG

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Keyword search matches tokens. Semantic search matches meaning. A query about "canine nutrition" should retrieve documents about "dog food" even though they share no words. Dense embedding models make this possible by mapping text to vectors where geometric proximity corresponds to semantic similarity. This is the retrieval backbone of RAG systems, recommendation engines, and multimodal search.

Mental Model

Encode both queries and documents into vectors in $\mathbb{R}^d$ . At query time, find the nearest document vectors. The embedding model must be trained so that semantically similar pairs land close together and dissimilar pairs land far apart. The search infrastructure must handle millions or billions of vectors efficiently.

Formal Setup

Definition

Dense embedding model

A function $f_\theta: \text{Text} \to \mathbb{R}^d$ mapping variable-length text to a fixed-dimensional vector. Typically implemented as a transformer encoder followed by pooling (CLS token or mean pooling). The dimension $d$ is commonly 384, 768, or 1024.

Definition

Cosine similarity

For vectors $u, v \in \mathbb{R}^d$ :

$\text{sim}(u, v) = \frac{u \cdot v}{\|u\| \|v\|}$

When vectors are $L_2$ -normalized (as most embedding models produce), cosine similarity equals the dot product $u \cdot v$ , and maximizing cosine similarity is equivalent to minimizing Euclidean distance: $\|u - v\|^2 = 2(1 - u \cdot v)$ .

Bi-Encoders vs Cross-Encoders

Bi-encoders encode query and document independently. The query embedding $f_\theta(q)$ is computed once and compared against precomputed document embeddings via dot product. This allows precomputation of all document embeddings and sub-linear search at query time. Sentence-BERT (Reimers & Gurevych, 2019) is the canonical bi-encoder for text.

Cross-encoders take the concatenated pair $(q, d)$ as input and produce a relevance score directly. They are more accurate because they allow token-level interaction between query and document, but they cannot precompute document representations. A cross-encoder must run inference for every (query, document) pair.

Late-interaction models (ColBERT, Khattab and Zaharia 2020) sit between the two. They store one vector per token rather than one per document and score with a MaxSim operator $\sum_q \max_d \langle q_i, d_j \rangle$ . This recovers most of the cross-encoder's accuracy while keeping document-side precomputation; ColBERTv2 (Santhanam et al. 2022) and PLAID (Santhanam et al. 2022) compress the per-token vectors to make this practical at scale.

The standard pipeline: bi-encoder retrieves top- $k$ candidates (fast, approximate), then cross-encoder reranks them (slow, accurate).

Modern Embedding Models (2023-2025)

The Sentence-BERT recipe has been substantially superseded. The 2023-2025 generation of embedding models share a common training pattern: large-scale weakly-supervised contrastive pretraining (hundreds of millions to billions of pairs from the web), followed by supervised fine-tuning on labeled retrieval and similarity datasets, often with hard-negative mining and multi-task objectives. The notable open and commercial families:

E5 (Wang et al. 2022/2024, Microsoft): multistage contrastive training on the CCPairs corpus; the multilingual multilingual-e5-large and the instruction-tuned e5-mistral-7b-instruct (Wang et al. 2024) extend the family to 7B parameters.
BGE / BGE-M3 (Xiao et al. 2024, BAAI): consistently strong on the MTEB English leaderboard; BGE-M3 unifies dense, sparse (BM25-like learned weights), and multi-vector (ColBERT-style) retrieval in a single model.
GTE (Li et al. 2023, Alibaba) and Snowflake Arctic Embed (Merrick et al. 2024): smaller, fast, often Pareto-optimal at moderate dimensions.
Nomic Embed (Nussbaum et al. 2024): fully open (data, weights, training code) 8192-context embedding model.
Voyage AI (voyage-3, voyage-code-3, voyage-law-2): commercial domain-specialized models that frequently top general and code retrieval benchmarks.
OpenAI text-embedding-3-small / text-embedding-3-large (2024) and Cohere embed-v3 (2023): commercial general-purpose embeddings with built-in Matryoshka support and explicit retrieval/clustering modes.
Mistral Embed and Jina Embeddings v3 (Sturua et al. 2024) round out the open-weights frontier, with Jina v3 introducing task-specific LoRA adapters for retrieval, classification, separation, and matching.

For non-text and cross-modal retrieval the analogous frontier is CLIP / SigLIP / SigLIP 2 (Radford et al. 2021, Zhai et al. 2023, Tschannen et al. 2025) for image-text and CLAP (Wu et al. 2023) for audio-text.

Build It This Way by Default

For a new retrieval system in 2025, a strong default is:

Use a current open embedding model (BGE-M3, Nomic Embed v1.5, or multilingual-e5-large for multilingual). Use a commercial model (voyage-3, OpenAI text-embedding-3-large) only if you cannot host GPUs or your domain matches a specialized commercial offering.
L2-normalize every vector and search by inner product.
Use Matryoshka representations if available: store full-dimension vectors and truncate at search time to trade quality for cost.
Hybrid search: combine dense scores with BM25 (or learned sparse like SPLADE / BGE-M3 sparse) using reciprocal-rank fusion. On most real corpora the hybrid wins by 3-10 nDCG points over either signal alone.
Rerank the top 50-200 with a cross-encoder (bge-reranker-v2-m3, Cohere Rerank, or a Voyage reranker) before passing to the LLM.
Index in a vector store sized to your scale: pgvector for under 10M vectors and operational simplicity; Qdrant, Weaviate, Milvus, or Vespa for self-hosted scale; Pinecone or Turbopuffer for hosted scale; FAISS or DiskANN when you need a library, not a server.
Validate every change against MTEB (Muennighoff et al. 2023) for English, MIRACL / MTEB Multilingual for multilingual, and a small in-domain eval set you actually care about.

Matryoshka Representations and Truncation

Definition

Matryoshka Representation Learning (MRL)

Kusupati et al. (2022) train an embedding so that the prefix $f_\theta(x)_{1:k}$ is itself a useful embedding for every $k$ in a predefined ladder (e.g., 64, 128, 256, 512, 1024). The training objective is a sum of contrastive losses computed on each prefix length, so each prefix is forced to carry stand-alone semantic information rather than relying on the full vector. At deployment, an application can pick the prefix length that fits its memory and latency budget without retraining.

OpenAI text-embedding-3-large (3072 dimensions, truncatable down to 256 with minor quality loss), Cohere embed-v3, Nomic Embed v1.5, Jina v3, and BGE-M3 all ship with Matryoshka heads. The practical pattern: store the full-dimension vector once, then serve different products with different truncations - 256-dim for first-stage ANN, full 1024- or 3072-dim for final reranking - without ever recomputing the encoder.

Training Embedding Models

Embedding models are trained with contrastive losses. Given a query $q$ , a positive document $d^+$ , and negative documents $d^-_1, \ldots, d^-_N$ :

$\mathcal{L} = -\log \frac{\exp(\text{sim}(q, d^+) / \tau)}{\exp(\text{sim}(q, d^+) / \tau) + \sum_{i} \exp(\text{sim}(q, d^-_i) / \tau)}$

where $\tau$ is a temperature parameter. This is the InfoNCE loss. Hard negatives (documents that are similar but not relevant) are critical for learning fine-grained distinctions.

Approximate Nearest Neighbor Search

Exact nearest neighbor search in $\mathbb{R}^d$ over $N$ vectors costs $O(Nd)$ per query. For $N$ in the millions, this is too slow. Approximate nearest neighbor (ANN) algorithms trade exactness for speed.

HNSW (Hierarchical Navigable Small World): builds a multi-layer graph where each node connects to nearby vectors. Search starts at the top layer (coarse) and descends to the bottom layer (fine). Query time is $O(\log N)$ in practice.

IVF (Inverted File Index): partitions the vector space into $C$ Voronoi cells using k-means. At query time, search only the $p$ nearest cells. Reduces the search space by a factor of $C/p$ .

Product Quantization (PQ): compresses each vector by splitting it into subvectors and quantizing each independently. Reduces memory and allows fast distance computation via lookup tables.

ScaNN (Scalable Nearest Neighbors), Guo et al. 2020: introduces anisotropic vector quantization, which weights quantization error by its projection onto the inner-product direction rather than treating all directions equally. Empirically dominates IVF-PQ at the recall-vs-throughput frontier for inner-product search and is the indexing backbone of much of Google's production retrieval.

DiskANN / Vamana, Subramanya et al. 2019: builds a single-layer proximity graph (Vamana) tuned so that the index fits on SSD with only the hot navigation set in RAM. Enables billion-scale ANN on a single workstation and is the basis of Microsoft's production vector indexes.

LSH (Locality Sensitive Hashing), Indyk and Motwani 1998: the historically first sublinear ANN method. Hashes such that nearby points collide with high probability. Largely superseded by graph- and quantization-based methods on real data, but still useful when you need pure-CPU, no-training, streaming-friendly indexing or theoretical guarantees.

FAISS (Facebook AI Similarity Search) combines several of these: IVF for coarse partitioning, PQ (or its more accurate variants OPQ and IVF-PQ-Refine) for compression, and optional HNSW for the coarse quantizer.

Lemma

Johnson-Lindenstrauss Lemma for Embeddings

Statement

For any set of $n$ points in $\mathbb{R}^d$ and any $\epsilon \in (0,1)$ , there exists a linear map $f: \mathbb{R}^d \to \mathbb{R}^k$ with $k = O(\epsilon^{-2} \log n)$ such that for all pairs $u, v$ :

$(1-\epsilon)\|u - v\|^2 \leq \|f(u) - f(v)\|^2 \leq (1+\epsilon)\|u - v\|^2$

Intuition

Pairwise distances are approximately preserved when projecting to $O(\log n / \epsilon^2)$ dimensions. This is why ANN search works in moderate dimensions: the intrinsic dimensionality of the point set is often much smaller than the ambient dimension $d$ .

Proof Sketch

Use a random projection matrix $f = (1/\sqrt{k})R$ where $R$ has i.i.d. sub-Gaussian entries. For a fixed pair, $\|f(u-v)\|^2$ concentrates around $\|u-v\|^2$ by sub-Gaussian tail bounds. Union bound over all $\binom{n}{2}$ pairs gives the required $k$ .

Why It Matters

JL justifies dimensionality reduction for search. If your embeddings are in $\mathbb{R}^{768}$ but you have $10^6$ documents, you can project to $\mathbb{R}^{128}$ with small distortion, drastically reducing memory and search cost. PQ implicitly exploits this by quantizing in subspaces.

Failure Mode

JL preserves Euclidean distances, not cosine similarities directly. For normalized vectors, the two are monotonically related, so it works. For unnormalized vectors, cosine similarity involves division by norms, and small norm distortions can cause large cosine distortions. Always normalize before applying random projection for cosine search.

report a correction →

Hybrid Search: Dense + Sparse

Pure dense retrieval misses exact-match signals (product codes, names, acronyms, rare technical terms) that BM25 catches trivially. Pure BM25 misses paraphrase and concept overlap. Hybrid search combines them.

The standard combiner is Reciprocal Rank Fusion (Cormack et al. 2009): given rankings $r_1, \ldots, r_K$ from $K$ retrievers, score each document by $\sum_k 1 / (c + r_k(d))$ for a small constant $c \approx 60$ . RRF is parameter-light and outperforms most weighted score-fusion schemes because it normalizes away the very different score scales of BM25 and cosine similarity.

Learned sparse retrievers - SPLADE (Formal et al. 2021), uniCOIL, and the sparse head of BGE-M3 - produce BM25-shaped term-weight vectors from a transformer. They preserve the inverted-index machinery (efficient on CPU, no GPU at query time) while learning expansion and re-weighting end to end. In production they are often used alongside a dense index for cheap, exact-match-friendly recall.

Vector Database Landscape

The 2024-2025 production stack splits roughly into:

Libraries: FAISS (Facebook AI Similarity Search), DiskANN, ScaNN, hnswlib, Annoy, Voyager. No service layer, embed in your own process.
Single-node servers: Qdrant, Weaviate, Chroma, Marqo, LanceDB. Easy to self-host, good developer experience, often Rust- or Go-based.
Distributed databases: Milvus, Vespa, Vald. Designed for tens of billions of vectors; more operational overhead.
OLTP extensions: pgvector and pgvecto.rs (PostgreSQL), Redis Search, Elasticsearch / OpenSearch dense_vector, MongoDB Atlas Vector Search. Trade peak performance for keeping vectors in your existing database.
Hosted / serverless: Pinecone, Turbopuffer, Vespa Cloud, Weaviate Cloud, Qdrant Cloud, MongoDB Atlas. Pay for managed scale and operational simplicity.

Most of these expose HNSW or IVF-PQ under the hood; the differentiation is in metadata filtering, multi-tenancy, hybrid search, replication, and cost. For new projects, the boring-but-correct default is to start with pgvector if you are already on Postgres, Qdrant or Weaviate if not, and graduate to a distributed system only when index size and QPS demand it.

The RAG Retrieval Pipeline

Retrieval-Augmented Generation (RAG) follows this flow:

Index: chunk documents, embed each chunk with a bi-encoder, store vectors in an ANN index alongside a sparse / BM25 index
Query: embed the user query with the same bi-encoder; tokenize for the sparse index
Search: find top- $k$ nearest chunks via ANN search and BM25; fuse the rankings (e.g., reciprocal rank fusion)
Rerank (optional): score the top 50-200 fused candidates with a cross-encoder
Generate: feed retrieved chunks as context to a language model

The quality bottleneck is usually retrieval, not generation. If the right documents are not in the top- $k$ , no amount of generation quality will compensate.

Common Confusions

Watch Out

Cosine similarity is not a distance metric

Cosine similarity ranges from $-1$ to $1$ and does not satisfy the triangle inequality. Cosine distance ( $1 - \text{sim}$ ) does not satisfy the triangle inequality either, even for non-negative vectors. Counterexample: for $a=(1,0)$ , $b=(1,1)/\sqrt{2}$ , $c=(0,1)$ , the cosine distance $d(a,c) = 1$ exceeds $d(a,b) + d(b,c) \approx 0.59$ . The angular distance $\arccos(\text{sim})/\pi$ is a valid metric. ANN libraries typically work with inner product or $L_2$ distance, not cosine similarity directly. Normalizing vectors converts cosine search to inner product search.

Watch Out

More dimensions does not always mean better embeddings

Embedding dimension is a design choice with diminishing returns. Going from 128 to 768 dimensions usually helps. Going from 768 to 4096 often does not justify the increased memory and search cost. The effective intrinsic dimension of the embedding manifold is typically much smaller than the ambient dimension.

Watch Out

Bi-encoder similarity is not the same as relevance

Two documents can be highly similar (both about dogs) without either being relevant to a specific query (about dog allergies). Similarity is symmetric; relevance is asymmetric. Cross-encoders model relevance better because they condition on the specific query-document pair.

Summary

Bi-encoders enable precomputation and fast search; cross-encoders are more accurate but require per-pair inference
Cosine similarity on normalized vectors equals dot product; maximizing it equals minimizing $L_2$ distance
ANN algorithms (HNSW, IVF, PQ) make billion-scale search practical
The JL lemma justifies dimensionality reduction for search
RAG quality is bottlenecked by retrieval quality, not generation quality
Hard negatives are critical for training good embedding models

Exercises

ExerciseCore

Problem

If embedding vectors are $L_2$ -normalized, show that maximizing cosine similarity between $u$ and $v$ is equivalent to minimizing $\|u - v\|_2^2$ .

ExerciseAdvanced

Problem

You have $10^9$ document embeddings in $\mathbb{R}^{768}$ . Each vector stored in float32 takes $768 \times 4 = 3072$ bytes. What is the total memory? If you use PQ with 96 subquantizers and 256 centroids per subquantizer (1 byte per subvector), what is the compressed memory?

References

Canonical:

Johnson & Lindenstrauss, "Extensions of Lipschitz mappings into a Hilbert space," AMS (1984)
Reimers & Gurevych, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks," EMNLP 2019

Training and contrastive objectives:

van den Oord, Li, Vinyals, "Representation Learning with Contrastive Predictive Coding" (2018). arXiv:1807.03748 (InfoNCE loss)
Karpukhin et al., "Dense Passage Retrieval for Open-Domain Question Answering" (EMNLP 2020). arXiv:2004.04906
Khattab & Zaharia, "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction" (SIGIR 2020). arXiv:2004.12832
Santhanam et al., "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction" (NAACL 2022). arXiv:2112.01488
Formal, Lassance, Piwowarski, Clinchant, "SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval" (2021). arXiv:2109.10086

ANN and indexing:

Indyk & Motwani, "Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality" (STOC 1998)
Malkov & Yashunin, "Efficient and Robust Approximate Nearest Neighbor using Hierarchical Navigable Small World Graphs" (IEEE TPAMI, 2020). arXiv:1603.09320
Jegou, Douze, Schmid, "Product Quantization for Nearest Neighbor Search" (IEEE TPAMI, 2011)
Guo et al., "Accelerating Large-Scale Inference with Anisotropic Vector Quantization" (ICML 2020). arXiv:1908.10396 (ScaNN)
Subramanya et al., "DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node" (NeurIPS 2019)
Johnson, Douze, Jegou, "Billion-scale similarity search with GPUs," IEEE TBD 2021
Cormack, Clarke, Buettcher, "Reciprocal Rank Fusion outperforms Condorcet and individual rank learning methods" (SIGIR 2009)

Modern embedding models:

Wang et al., "Text Embeddings by Weakly-Supervised Contrastive Pre-training" (2022). arXiv:2212.03533 (E5)
Wang et al., "Improving Text Embeddings with Large Language Models" (ACL 2024). arXiv:2401.00368 (E5-Mistral)
Xiao et al., "C-Pack: Packed Resources For General Chinese Embeddings" (SIGIR 2024). arXiv:2309.07597 (BGE)
Chen, Xiao, Zhang et al., "BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation" (2024). arXiv:2402.03216
Li et al., "Towards General Text Embeddings with Multi-stage Contrastive Learning" (2023). arXiv:2308.03281 (GTE)
Nussbaum et al., "Nomic Embed: Training a Reproducible Long Context Text Embedder" (2024). arXiv:2402.01613
Sturua et al., "jina-embeddings-v3: Multilingual Embeddings With Task LoRA" (2024). arXiv:2409.10173
Merrick et al., "Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models" (2024). arXiv:2405.05374

Matryoshka and dimension reduction:

Kusupati et al., "Matryoshka Representation Learning" (NeurIPS 2022). arXiv:2205.13147

Multimodal embeddings:

Radford et al., "Learning Transferable Visual Models From Natural Language Supervision" (ICML 2021). arXiv:2103.00020 (CLIP)
Zhai et al., "Sigmoid Loss for Language Image Pre-Training" (ICCV 2023). arXiv:2303.15343 (SigLIP)

Current:

Muennighoff et al., "MTEB: Massive Text Embedding Benchmark," EACL 2023. arXiv:2210.07316

Next Topics

From semantic search, the natural continuations are:

Multimodal RAG: extending retrieval to images, tables, and code
Context engineering: how to select and arrange retrieved context for LLMs

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Inner Product Spaces and Orthogonalitylayer 0A · tier 1
Information Retrieval Foundationslayer 2 · tier 1
K-Nearest Neighborslayer 1 · tier 2
Word Embeddingslayer 2 · tier 2

Derived topics

2

Context Engineeringlayer 5 · tier 2
Multimodal RAGlayer 5 · tier 2

Graph-backed continuations

Multimodal RAG Context Engineering