Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Semantic Search and Embeddings

Dense vector representations for semantic similarity: bi-encoders, cross-encoders, approximate nearest neighbor search, cosine similarity geometry, and the RAG retrieval pipeline.

CoreTier 2Current~50 min

Prerequisites

0

Why This Matters

Keyword search matches tokens. Semantic search matches meaning. A query about "canine nutrition" should retrieve documents about "dog food" even though they share no words. Dense embedding models make this possible by mapping text to vectors where geometric proximity corresponds to semantic similarity. This is the retrieval backbone of RAG systems, recommendation engines, and multimodal search.

Mental Model

Encode both queries and documents into vectors in Rd\mathbb{R}^d. At query time, find the nearest document vectors. The embedding model must be trained so that semantically similar pairs land close together and dissimilar pairs land far apart. The search infrastructure must handle millions or billions of vectors efficiently.

Formal Setup

Definition

Dense embedding model

A function fθ:TextRdf_\theta: \text{Text} \to \mathbb{R}^d mapping variable-length text to a fixed-dimensional vector. Typically implemented as a transformer encoder followed by pooling (CLS token or mean pooling). The dimension dd is commonly 384, 768, or 1024.

Definition

Cosine similarity

For vectors u,vRdu, v \in \mathbb{R}^d:

sim(u,v)=uvuv\text{sim}(u, v) = \frac{u \cdot v}{\|u\| \|v\|}

When vectors are L2L_2-normalized (as most embedding models produce), cosine similarity equals the dot product uvu \cdot v, and maximizing cosine similarity is equivalent to minimizing Euclidean distance: uv2=2(1uv)\|u - v\|^2 = 2(1 - u \cdot v).

Bi-Encoders vs Cross-Encoders

Bi-encoders encode query and document independently. The query embedding fθ(q)f_\theta(q) is computed once and compared against precomputed document embeddings via dot product. This allows precomputation of all document embeddings and sub-linear search at query time. Sentence-BERT (Reimers & Gurevych, 2019) is the canonical bi-encoder for text.

Cross-encoders take the concatenated pair (q,d)(q, d) as input and produce a relevance score directly. They are more accurate because they allow token-level interaction between query and document, but they cannot precompute document representations. A cross-encoder must run inference for every (query, document) pair.

The standard pipeline: bi-encoder retrieves top-kk candidates (fast, approximate), then cross-encoder reranks them (slow, accurate).

Training Embedding Models

Embedding models are trained with contrastive losses. Given a query qq, a positive document d+d^+, and negative documents d1,,dNd^-_1, \ldots, d^-_N:

L=logexp(sim(q,d+)/τ)exp(sim(q,d+)/τ)+iexp(sim(q,di)/τ)\mathcal{L} = -\log \frac{\exp(\text{sim}(q, d^+) / \tau)}{\exp(\text{sim}(q, d^+) / \tau) + \sum_{i} \exp(\text{sim}(q, d^-_i) / \tau)}

where τ\tau is a temperature parameter. This is the InfoNCE loss. Hard negatives (documents that are similar but not relevant) are critical for learning fine-grained distinctions.

Approximate Nearest Neighbor Search

Exact nearest neighbor search in Rd\mathbb{R}^d over NN vectors costs O(Nd)O(Nd) per query. For NN in the millions, this is too slow. Approximate nearest neighbor (ANN) algorithms trade exactness for speed.

HNSW (Hierarchical Navigable Small World): builds a multi-layer graph where each node connects to nearby vectors. Search starts at the top layer (coarse) and descends to the bottom layer (fine). Query time is O(logN)O(\log N) in practice.

IVF (Inverted File Index): partitions the vector space into CC Voronoi cells using k-means. At query time, search only the pp nearest cells. Reduces the search space by a factor of C/pC/p.

Product Quantization (PQ): compresses each vector by splitting it into subvectors and quantizing each independently. Reduces memory and allows fast distance computation via lookup tables.

FAISS (Facebook AI Similarity Search) combines these: IVF for coarse partitioning, PQ for compression, and optional HNSW for the coarse quantizer.

Lemma

Johnson-Lindenstrauss Lemma for Embeddings

Statement

For any set of nn points in Rd\mathbb{R}^d and any ϵ(0,1)\epsilon \in (0,1), there exists a linear map f:RdRkf: \mathbb{R}^d \to \mathbb{R}^k with k=O(ϵ2logn)k = O(\epsilon^{-2} \log n) such that for all pairs u,vu, v:

(1ϵ)uv2f(u)f(v)2(1+ϵ)uv2(1-\epsilon)\|u - v\|^2 \leq \|f(u) - f(v)\|^2 \leq (1+\epsilon)\|u - v\|^2

Intuition

Pairwise distances are approximately preserved when projecting to O(logn/ϵ2)O(\log n / \epsilon^2) dimensions. This is why ANN search works in moderate dimensions: the intrinsic dimensionality of the point set is often much smaller than the ambient dimension dd.

Proof Sketch

Use a random projection matrix f=(1/k)Rf = (1/\sqrt{k})R where RR has i.i.d. sub-Gaussian entries. For a fixed pair, f(uv)2\|f(u-v)\|^2 concentrates around uv2\|u-v\|^2 by sub-Gaussian tail bounds. Union bound over all (n2)\binom{n}{2} pairs gives the required kk.

Why It Matters

JL justifies dimensionality reduction for search. If your embeddings are in R768\mathbb{R}^{768} but you have 10610^6 documents, you can project to R128\mathbb{R}^{128} with small distortion, drastically reducing memory and search cost. PQ implicitly exploits this by quantizing in subspaces.

Failure Mode

JL preserves Euclidean distances, not cosine similarities directly. For normalized vectors, the two are monotonically related, so it works. For unnormalized vectors, cosine similarity involves division by norms, and small norm distortions can cause large cosine distortions. Always normalize before applying random projection for cosine search.

The RAG Retrieval Pipeline

Retrieval-Augmented Generation (RAG) follows this flow:

  1. Index: chunk documents, embed each chunk with a bi-encoder, store vectors in an ANN index
  2. Query: embed the user query with the same bi-encoder
  3. Search: find top-kk nearest chunks via ANN search
  4. Rerank (optional): score top-kk with a cross-encoder
  5. Generate: feed retrieved chunks as context to a language model

The quality bottleneck is usually retrieval, not generation. If the right documents are not in the top-kk, no amount of generation quality will compensate.

Common Confusions

Watch Out

Cosine similarity is not a distance metric

Cosine similarity ranges from 1-1 to 11 and does not satisfy the triangle inequality. Cosine distance (1sim1 - \text{sim}) satisfies the triangle inequality for non-negative vectors but not in general. ANN libraries typically work with inner product or L2L_2 distance, not cosine similarity directly. Normalizing vectors converts cosine search to inner product search.

Watch Out

More dimensions does not always mean better embeddings

Embedding dimension is a design choice with diminishing returns. Going from 128 to 768 dimensions usually helps. Going from 768 to 4096 often does not justify the increased memory and search cost. The effective intrinsic dimension of the embedding manifold is typically much smaller than the ambient dimension.

Watch Out

Bi-encoder similarity is not the same as relevance

Two documents can be highly similar (both about dogs) without either being relevant to a specific query (about dog allergies). Similarity is symmetric; relevance is asymmetric. Cross-encoders model relevance better because they condition on the specific query-document pair.

Summary

  • Bi-encoders enable precomputation and fast search; cross-encoders are more accurate but require per-pair inference
  • Cosine similarity on normalized vectors equals dot product; maximizing it equals minimizing L2L_2 distance
  • ANN algorithms (HNSW, IVF, PQ) make billion-scale search practical
  • The JL lemma justifies dimensionality reduction for search
  • RAG quality is bottlenecked by retrieval quality, not generation quality
  • Hard negatives are critical for training good embedding models

Exercises

ExerciseCore

Problem

If embedding vectors are L2L_2-normalized, show that maximizing cosine similarity between uu and vv is equivalent to minimizing uv22\|u - v\|_2^2.

ExerciseAdvanced

Problem

You have 10910^9 document embeddings in R768\mathbb{R}^{768}. Each vector stored in float32 takes 768×4=3072768 \times 4 = 3072 bytes. What is the total memory? If you use PQ with 96 subquantizers and 256 centroids per subquantizer (1 byte per subvector), what is the compressed memory?

References

Canonical:

  • Johnson & Lindenstrauss, "Extensions of Lipschitz mappings into a Hilbert space," AMS (1984)
  • Reimers & Gurevych, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks," EMNLP 2019

Current:

  • Johnson, Douze, Jegou, "Billion-scale similarity search with GPUs," IEEE TBD 2021

  • Muennighoff et al., "MTEB: Massive Text Embedding Benchmark," EACL 2023

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

Next Topics

From semantic search, the natural continuations are:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics