ML Methods
Semantic Search and Embeddings
Dense vector representations for semantic similarity: bi-encoders, cross-encoders, approximate nearest neighbor search, cosine similarity geometry, and the RAG retrieval pipeline.
Prerequisites
Why This Matters
Keyword search matches tokens. Semantic search matches meaning. A query about "canine nutrition" should retrieve documents about "dog food" even though they share no words. Dense embedding models make this possible by mapping text to vectors where geometric proximity corresponds to semantic similarity. This is the retrieval backbone of RAG systems, recommendation engines, and multimodal search.
Mental Model
Encode both queries and documents into vectors in . At query time, find the nearest document vectors. The embedding model must be trained so that semantically similar pairs land close together and dissimilar pairs land far apart. The search infrastructure must handle millions or billions of vectors efficiently.
Formal Setup
Dense embedding model
A function mapping variable-length text to a fixed-dimensional vector. Typically implemented as a transformer encoder followed by pooling (CLS token or mean pooling). The dimension is commonly 384, 768, or 1024.
Cosine similarity
For vectors :
When vectors are -normalized (as most embedding models produce), cosine similarity equals the dot product , and maximizing cosine similarity is equivalent to minimizing Euclidean distance: .
Bi-Encoders vs Cross-Encoders
Bi-encoders encode query and document independently. The query embedding is computed once and compared against precomputed document embeddings via dot product. This allows precomputation of all document embeddings and sub-linear search at query time. Sentence-BERT (Reimers & Gurevych, 2019) is the canonical bi-encoder for text.
Cross-encoders take the concatenated pair as input and produce a relevance score directly. They are more accurate because they allow token-level interaction between query and document, but they cannot precompute document representations. A cross-encoder must run inference for every (query, document) pair.
The standard pipeline: bi-encoder retrieves top- candidates (fast, approximate), then cross-encoder reranks them (slow, accurate).
Training Embedding Models
Embedding models are trained with contrastive losses. Given a query , a positive document , and negative documents :
where is a temperature parameter. This is the InfoNCE loss. Hard negatives (documents that are similar but not relevant) are critical for learning fine-grained distinctions.
Approximate Nearest Neighbor Search
Exact nearest neighbor search in over vectors costs per query. For in the millions, this is too slow. Approximate nearest neighbor (ANN) algorithms trade exactness for speed.
HNSW (Hierarchical Navigable Small World): builds a multi-layer graph where each node connects to nearby vectors. Search starts at the top layer (coarse) and descends to the bottom layer (fine). Query time is in practice.
IVF (Inverted File Index): partitions the vector space into Voronoi cells using k-means. At query time, search only the nearest cells. Reduces the search space by a factor of .
Product Quantization (PQ): compresses each vector by splitting it into subvectors and quantizing each independently. Reduces memory and allows fast distance computation via lookup tables.
FAISS (Facebook AI Similarity Search) combines these: IVF for coarse partitioning, PQ for compression, and optional HNSW for the coarse quantizer.
Johnson-Lindenstrauss Lemma for Embeddings
Statement
For any set of points in and any , there exists a linear map with such that for all pairs :
Intuition
Pairwise distances are approximately preserved when projecting to dimensions. This is why ANN search works in moderate dimensions: the intrinsic dimensionality of the point set is often much smaller than the ambient dimension .
Proof Sketch
Use a random projection matrix where has i.i.d. sub-Gaussian entries. For a fixed pair, concentrates around by sub-Gaussian tail bounds. Union bound over all pairs gives the required .
Why It Matters
JL justifies dimensionality reduction for search. If your embeddings are in but you have documents, you can project to with small distortion, drastically reducing memory and search cost. PQ implicitly exploits this by quantizing in subspaces.
Failure Mode
JL preserves Euclidean distances, not cosine similarities directly. For normalized vectors, the two are monotonically related, so it works. For unnormalized vectors, cosine similarity involves division by norms, and small norm distortions can cause large cosine distortions. Always normalize before applying random projection for cosine search.
The RAG Retrieval Pipeline
Retrieval-Augmented Generation (RAG) follows this flow:
- Index: chunk documents, embed each chunk with a bi-encoder, store vectors in an ANN index
- Query: embed the user query with the same bi-encoder
- Search: find top- nearest chunks via ANN search
- Rerank (optional): score top- with a cross-encoder
- Generate: feed retrieved chunks as context to a language model
The quality bottleneck is usually retrieval, not generation. If the right documents are not in the top-, no amount of generation quality will compensate.
Common Confusions
Cosine similarity is not a distance metric
Cosine similarity ranges from to and does not satisfy the triangle inequality. Cosine distance () satisfies the triangle inequality for non-negative vectors but not in general. ANN libraries typically work with inner product or distance, not cosine similarity directly. Normalizing vectors converts cosine search to inner product search.
More dimensions does not always mean better embeddings
Embedding dimension is a design choice with diminishing returns. Going from 128 to 768 dimensions usually helps. Going from 768 to 4096 often does not justify the increased memory and search cost. The effective intrinsic dimension of the embedding manifold is typically much smaller than the ambient dimension.
Bi-encoder similarity is not the same as relevance
Two documents can be highly similar (both about dogs) without either being relevant to a specific query (about dog allergies). Similarity is symmetric; relevance is asymmetric. Cross-encoders model relevance better because they condition on the specific query-document pair.
Summary
- Bi-encoders enable precomputation and fast search; cross-encoders are more accurate but require per-pair inference
- Cosine similarity on normalized vectors equals dot product; maximizing it equals minimizing distance
- ANN algorithms (HNSW, IVF, PQ) make billion-scale search practical
- The JL lemma justifies dimensionality reduction for search
- RAG quality is bottlenecked by retrieval quality, not generation quality
- Hard negatives are critical for training good embedding models
Exercises
Problem
If embedding vectors are -normalized, show that maximizing cosine similarity between and is equivalent to minimizing .
Problem
You have document embeddings in . Each vector stored in float32 takes bytes. What is the total memory? If you use PQ with 96 subquantizers and 256 centroids per subquantizer (1 byte per subvector), what is the compressed memory?
References
Canonical:
- Johnson & Lindenstrauss, "Extensions of Lipschitz mappings into a Hilbert space," AMS (1984)
- Reimers & Gurevych, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks," EMNLP 2019
Current:
-
Johnson, Douze, Jegou, "Billion-scale similarity search with GPUs," IEEE TBD 2021
-
Muennighoff et al., "MTEB: Massive Text Embedding Benchmark," EACL 2023
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
Next Topics
From semantic search, the natural continuations are:
- Multimodal RAG: extending retrieval to images, tables, and code
- Context engineering: how to select and arrange retrieved context for LLMs
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Word EmbeddingsLayer 2