NLP Foundations
Distributional Semantics
You shall know a word by the company it keeps. The distributional hypothesis, co-occurrence matrices, PMI, SVD-based embeddings, and the mathematical bridge from linguistics to word vectors and transformers.
Why This Matters
Most of modern NLP rests on a single empirical observation: words that appear in similar contexts tend to have similar meanings. This is the distributional hypothesis. Every word embedding method, from LSA to word2vec to GloVe to the contextual representations in transformers, operationalizes this hypothesis in a specific mathematical way. Understanding the distributional semantics pipeline (co-occurrence matrix, PMI weighting, low-rank factorization) clarifies what these models actually learn and what they cannot capture.
The key result connecting "old-school" count-based methods to "new-school" prediction-based methods is the Levy and Goldberg (2014) theorem: word2vec's skip-gram with negative sampling is implicitly factorizing a shifted PMI matrix. This unifies the two paradigms and shows that the mathematical structure, not the training algorithm, determines what embeddings encode.
The Distributional Hypothesis
The idea originates with Harris (1954) and Firth (1957). Firth's formulation: "You shall know a word by the company it keeps." The formal version:
Distributional Hypothesis: The meaning of a word is determined by the distribution of words that co-occur with it. Two words are semantically similar if and only if they appear in similar linguistic contexts.
This is an empirical claim, not a theorem. It works surprisingly well for certain types of semantic similarity (topical relatedness, synonymy) and poorly for others (antonymy, pragmatic meaning, logical entailment). The word pairs "hot" and "cold" appear in nearly identical contexts but have opposite meanings.
Co-occurrence Matrices
Word-Context Co-occurrence Matrix
Given a vocabulary of words and a set of contexts (typically also words within a window of size ), the co-occurrence matrix has entries:
where counts the number of times word appears within a window of size of context word in the corpus. The matrix is typically sparse and very large ( can be to ).
The choice of context window matters. Small windows () capture syntactic similarity (words that are interchangeable in a sentence). Large windows () capture topical similarity (words that appear in the same documents). This is a design choice, not a mathematical property.
Pointwise Mutual Information
Raw co-occurrence counts are dominated by frequent words ("the", "of", "and"). PMI corrects for this by measuring whether two words co-occur more than expected under independence.
Pointwise Mutual Information
The pointwise mutual information between word and context is:
where is the total number of word-context pairs. PMI is positive when and co-occur more than expected, zero when independent, and negative when they co-occur less than expected.
Positive PMI (PPMI): In practice, negative PMI values are unreliable (they require observing that a rare event is even rarer than expected, which needs enormous corpora). The standard fix is:
PMI is not symmetric in general practice
Mathematically, since the joint probability is symmetric. However, in implementations where the context set differs from the vocabulary (e.g., using dependency relations as contexts), the resulting matrix need not be square or symmetric. The symmetry holds only when word and context vocabularies coincide and the co-occurrence relation is symmetric.
SVD-Based Embeddings (Latent Semantic Analysis)
The PPMI matrix is high-dimensional () and sparse. Singular value decomposition provides the optimal low-rank approximation (by the Eckart-Young theorem), giving dense, low-dimensional word vectors.
LSA Embeddings via Truncated SVD
Given the PPMI matrix , compute the truncated SVD:
where is the embedding dimension (typically 100-300). The word embeddings are:
Each row of is the -dimensional vector for a word. The context embeddings are . The power on distributes the singular values equally between word and context representations. Some implementations use for , with and as common alternatives.
This is exactly the Latent Semantic Analysis (LSA) method of Deerwester et al. (1990), though they applied it to the term-document matrix rather than the word-context matrix. The mathematical machinery is identical.
The Skip-Gram and PMI Connection
Word2vec's skip-gram model with negative sampling (SGNS) trains word and context vectors and by maximizing:
where is the sigmoid function, is the number of negative samples, and is the noise distribution (typically the unigram distribution raised to the 3/4 power).
Skip-Gram with Negative Sampling Implicitly Factorizes a Shifted PMI Matrix
Statement
Let and be the word and context vectors learned by word2vec SGNS with negative samples. At the global optimum of the SGNS objective:
The skip-gram model implicitly factorizes the matrix where , which is the PMI matrix shifted by .
Intuition
Each word-context pair contributes a logistic regression problem: distinguish the real pair from noise pairs. The optimal solution to this binary classification sets the log-odds equal to the log ratio of the data distribution to the noise distribution, which is exactly PMI (up to the shift from the negative samples). The dot product of the learned vectors converges to this PMI value.
Proof Sketch
For a fixed word-context pair , the SGNS objective for that pair is:
where . Taking the derivative and setting it to zero:
Solving for using :
When :
Why It Matters
This theorem unifies count-based and prediction-based word embedding methods. The two paradigms that the NLP community treated as distinct (SVD of PMI matrix vs. neural network training) are optimizing closely related objectives. The differences in practice come from implementation details: subsampling of frequent words, the 3/4 power on the noise distribution, stochastic optimization vs. exact SVD, and the dimensionality bottleneck forcing implicit regularization.
Failure Mode
The equivalence holds at the global optimum with infinite data and full-rank embeddings. In practice: (1) embeddings have dimension , so the factorization is approximate and the low-rank constraint acts as implicit regularization; (2) stochastic gradient descent may not reach the global optimum; (3) the 3/4 power on the noise distribution (used in practice but not in the theorem) changes the implicit matrix being factorized.
GloVe: Explicit Co-occurrence Factorization
GloVe (Pennington et al., 2014) takes the connection between co-occurrence and embeddings and makes it explicit. Instead of implicit factorization through a prediction task, GloVe directly optimizes:
where is a weighting function that caps the influence of very frequent pairs (typically ) and are bias terms.
The target can be rewritten as , so GloVe is fitting the log co-occurrence (up to bias terms) with a weighted least squares objective. This is a direct matrix factorization, closer in spirit to SVD than to word2vec.
What Distributional Embeddings Capture and Miss
Distributional similarity is not semantic identity
Words with similar distributional profiles may be synonyms ("big" and "large"), but they may also be antonyms ("hot" and "cold"), co-hyponyms ("cat" and "dog"), or words related by other systematic patterns. Distributional methods capture relatedness (words that appear in similar contexts) but do not reliably distinguish the type of relation. The pair (king, queen) and the pair (king, kingdom) are both distributionally similar for different reasons.
Static embeddings assign one vector per word
In distributional models like word2vec and GloVe, each word type gets a single vector regardless of context. The word "bank" has one embedding whether it means a river bank or a financial institution. This is a structural limitation that contextual models (ELMo, BERT, GPT) address by computing word representations conditioned on the surrounding sentence.
What embeddings encode well:
- Semantic similarity and relatedness (cosine similarity correlates with human judgments)
- Analogy structure (the famous pattern)
- Topical clustering (words from the same domain cluster together)
What embeddings miss:
- Negation and logical operators ("not good" vs. "good")
- Compositionality beyond simple addition
- Rare words and proper nouns (insufficient co-occurrence data)
- Grounding (no connection to perceptual or physical meaning)
From Static to Contextual: The Transformer Extension
Transformers extend distributional semantics by making word representations context-dependent. Instead of a fixed vector for each word type, a transformer computes a representation for each word token conditioned on the entire input sequence. The self-attention mechanism computes, for each position, a weighted sum over all other positions, where the weights are determined by the content of the tokens. This is distributional semantics applied at the token level within a single sequence, rather than aggregated across a corpus.
The progression: co-occurrence matrix (global statistics) to word2vec (local prediction with global aggregation via SGD) to transformers (local computation, contextual representations). Each step adds expressiveness while preserving the core distributional insight that meaning is determined by context.
Key Takeaways
- The distributional hypothesis states that word meaning is determined by co-occurrence patterns
- Raw co-occurrence counts are dominated by frequent words; PMI corrects for expected co-occurrence
- SVD of the PPMI matrix (LSA) gives optimal low-rank word embeddings by Eckart-Young
- Word2vec SGNS implicitly factorizes a shifted PMI matrix (Levy and Goldberg, 2014)
- GloVe explicitly factorizes log co-occurrence with weighted least squares
- Static embeddings capture relatedness but not relation type, polysemy, or compositionality
- Contextual models (transformers) extend distributional semantics to token-level representations
Exercises
Problem
Consider a tiny corpus with vocabulary and the following bigram counts (window size 1):
| a | b | c | |
|---|---|---|---|
| a | 0 | 10 | 2 |
| b | 10 | 0 | 8 |
| c | 2 | 8 | 0 |
The total count is , with marginals , , (summing the full symmetric matrix, so each co-occurrence counted once per direction). Compute and . Which pair is more associated?
Problem
Explain why the Levy-Goldberg result implies that increasing the number of negative samples in word2vec is equivalent to applying a stronger threshold on the PMI matrix. What does this mean for the behavior of the resulting embeddings?
Problem
Let be the symmetric PPMI matrix. Suppose you compute the rank- truncated SVD and form word vectors . Show that the cosine similarity between word vectors and in this embedding is related to the inner product of rows and of the best rank- approximation of , and explain why this connection justifies using cosine similarity to measure semantic relatedness.
References
Canonical:
- Harris, "Distributional Structure" (1954), the original distributional hypothesis
- Firth, Studies in Linguistic Analysis (1957), Chapter 1, "a word is characterized by the company it keeps"
- Deerwester et al., "Indexing by Latent Semantic Analysis" (1990), the LSA paper: SVD applied to term-document matrices
- Church & Hanks, "Word Association Norms, Mutual Information, and Lexicography" (1990), PMI for lexical association
Unifying Results:
- Levy & Goldberg, "Neural Word Embedding as Implicit Matrix Factorization" (NeurIPS 2014), Sections 2-4, the SGNS-PMI equivalence theorem
- Pennington, Socher, Manning, "GloVe: Global Vectors for Word Representation" (EMNLP 2014), Sections 2-3
Textbooks and Surveys:
- Jurafsky & Martin, Speech and Language Processing (3rd ed., 2024), Chapters 6-7
- Turney & Pantel, "From Frequency to Meaning: Vector Space Models of Semantics" (JAIR, 2010), Sections 1-5
Next Topics
- Word embeddings: word2vec, GloVe, and fastText in detail, training procedures, and evaluation
- Transformer architecture: contextual representations that extend distributional semantics to the token level
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Singular Value DecompositionLayer 0A
- Eigenvalues and EigenvectorsLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Differentiation in RnLayer 0A