Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

NLP Foundations

Distributional Semantics

You shall know a word by the company it keeps. The distributional hypothesis, co-occurrence matrices, PMI, SVD-based embeddings, and the mathematical bridge from linguistics to word vectors and transformers.

CoreTier 2Stable~40 min
0

Why This Matters

Most of modern NLP rests on a single empirical observation: words that appear in similar contexts tend to have similar meanings. This is the distributional hypothesis. Every word embedding method, from LSA to word2vec to GloVe to the contextual representations in transformers, operationalizes this hypothesis in a specific mathematical way. Understanding the distributional semantics pipeline (co-occurrence matrix, PMI weighting, low-rank factorization) clarifies what these models actually learn and what they cannot capture.

The key result connecting "old-school" count-based methods to "new-school" prediction-based methods is the Levy and Goldberg (2014) theorem: word2vec's skip-gram with negative sampling is implicitly factorizing a shifted PMI matrix. This unifies the two paradigms and shows that the mathematical structure, not the training algorithm, determines what embeddings encode.

The Distributional Hypothesis

The idea originates with Harris (1954) and Firth (1957). Firth's formulation: "You shall know a word by the company it keeps." The formal version:

Distributional Hypothesis: The meaning of a word is determined by the distribution of words that co-occur with it. Two words are semantically similar if and only if they appear in similar linguistic contexts.

This is an empirical claim, not a theorem. It works surprisingly well for certain types of semantic similarity (topical relatedness, synonymy) and poorly for others (antonymy, pragmatic meaning, logical entailment). The word pairs "hot" and "cold" appear in nearly identical contexts but have opposite meanings.

Co-occurrence Matrices

Definition

Word-Context Co-occurrence Matrix

Given a vocabulary VV of words and a set CC of contexts (typically also words within a window of size LL), the co-occurrence matrix MM has entries:

Mwc=#(w,c)M_{wc} = \#(w, c)

where #(w,c)\#(w, c) counts the number of times word ww appears within a window of size LL of context word cc in the corpus. The matrix MM is typically sparse and very large (V|V| can be 10510^5 to 10610^6).

The choice of context window LL matters. Small windows (L=2L = 2) capture syntactic similarity (words that are interchangeable in a sentence). Large windows (L=10L = 10) capture topical similarity (words that appear in the same documents). This is a design choice, not a mathematical property.

Pointwise Mutual Information

Raw co-occurrence counts are dominated by frequent words ("the", "of", "and"). PMI corrects for this by measuring whether two words co-occur more than expected under independence.

Definition

Pointwise Mutual Information

The pointwise mutual information between word ww and context cc is:

PMI(w,c)=logP(w,c)P(w)P(c)=log#(w,c)D#(w)#(c)\text{PMI}(w, c) = \log \frac{P(w, c)}{P(w) \cdot P(c)} = \log \frac{\#(w, c) \cdot |D|}{\#(w) \cdot \#(c)}

where D=w,c#(w,c)|D| = \sum_{w,c} \#(w,c) is the total number of word-context pairs. PMI is positive when ww and cc co-occur more than expected, zero when independent, and negative when they co-occur less than expected.

Positive PMI (PPMI): In practice, negative PMI values are unreliable (they require observing that a rare event is even rarer than expected, which needs enormous corpora). The standard fix is:

PPMI(w,c)=max(0,PMI(w,c))\text{PPMI}(w, c) = \max(0, \text{PMI}(w, c))

Watch Out

PMI is not symmetric in general practice

Mathematically, PMI(w,c)=PMI(c,w)\text{PMI}(w, c) = \text{PMI}(c, w) since the joint probability is symmetric. However, in implementations where the context set CC differs from the vocabulary VV (e.g., using dependency relations as contexts), the resulting matrix need not be square or symmetric. The symmetry holds only when word and context vocabularies coincide and the co-occurrence relation is symmetric.

SVD-Based Embeddings (Latent Semantic Analysis)

The PPMI matrix is high-dimensional (V×C|V| \times |C|) and sparse. Singular value decomposition provides the optimal low-rank approximation (by the Eckart-Young theorem), giving dense, low-dimensional word vectors.

Definition

LSA Embeddings via Truncated SVD

Given the PPMI matrix M+RV×CM^+ \in \mathbb{R}^{|V| \times |C|}, compute the truncated SVD:

M+UdΣdVdM^+ \approx U_d \Sigma_d V_d^\top

where dmin(V,C)d \ll \min(|V|, |C|) is the embedding dimension (typically 100-300). The word embeddings are:

W=UdΣd1/2RV×dW = U_d \Sigma_d^{1/2} \in \mathbb{R}^{|V| \times d}

Each row of WW is the dd-dimensional vector for a word. The context embeddings are C=VdΣd1/2C = V_d \Sigma_d^{1/2}. The power 1/21/2 on Σd\Sigma_d distributes the singular values equally between word and context representations. Some implementations use W=UdΣdαW = U_d \Sigma_d^\alpha for α[0,1]\alpha \in [0, 1], with α=1\alpha = 1 and α=0\alpha = 0 as common alternatives.

This is exactly the Latent Semantic Analysis (LSA) method of Deerwester et al. (1990), though they applied it to the term-document matrix rather than the word-context matrix. The mathematical machinery is identical.

The Skip-Gram and PMI Connection

Word2vec's skip-gram model with negative sampling (SGNS) trains word and context vectors w\vec{w} and c\vec{c} by maximizing:

J=(w,c)Dlogσ(wc)+kEcPn[logσ(wc)]J = \sum_{(w,c) \in D} \log \sigma(\vec{w} \cdot \vec{c}) + k \cdot \mathbb{E}_{c' \sim P_n} [\log \sigma(-\vec{w} \cdot \vec{c'})]

where σ\sigma is the sigmoid function, kk is the number of negative samples, and PnP_n is the noise distribution (typically the unigram distribution raised to the 3/4 power).

Theorem

Skip-Gram with Negative Sampling Implicitly Factorizes a Shifted PMI Matrix

Statement

Let w\vec{w} and c\vec{c} be the word and context vectors learned by word2vec SGNS with kk negative samples. At the global optimum of the SGNS objective:

wc=PMI(w,c)logk\vec{w} \cdot \vec{c} = \text{PMI}(w, c) - \log k

The skip-gram model implicitly factorizes the matrix MM^* where Mwc=PMI(w,c)logkM^*_{wc} = \text{PMI}(w, c) - \log k, which is the PMI matrix shifted by logk\log k.

Intuition

Each word-context pair (w,c)(w, c) contributes a logistic regression problem: distinguish the real pair from kk noise pairs. The optimal solution to this binary classification sets the log-odds equal to the log ratio of the data distribution to the noise distribution, which is exactly PMI (up to the shift from the kk negative samples). The dot product of the learned vectors converges to this PMI value.

Proof Sketch

For a fixed word-context pair (w,c)(w, c), the SGNS objective for that pair is:

wc=#(w,c)logσ(x)+k#(w)Pn(c)logσ(x)\ell_{wc} = \#(w,c) \log \sigma(x) + k \cdot \#(w) \cdot P_n(c) \cdot \log \sigma(-x)

where x=wcx = \vec{w} \cdot \vec{c}. Taking the derivative and setting it to zero:

#(w,c)σ(x)=k#(w)Pn(c)σ(x)\#(w,c) \cdot \sigma(-x) = k \cdot \#(w) \cdot P_n(c) \cdot \sigma(x)

Solving for xx using σ(x)/σ(x)=ex\sigma(-x)/\sigma(x) = e^{-x}:

ex=#(w,c)k#(w)Pn(c)e^x = \frac{\#(w,c)}{k \cdot \#(w) \cdot P_n(c)}

When Pn(c)=#(c)/DP_n(c) = \#(c)/|D|:

x=log#(w,c)D#(w)#(c)logk=PMI(w,c)logkx = \log \frac{\#(w,c) \cdot |D|}{\#(w) \cdot \#(c)} - \log k = \text{PMI}(w,c) - \log k

Why It Matters

This theorem unifies count-based and prediction-based word embedding methods. The two paradigms that the NLP community treated as distinct (SVD of PMI matrix vs. neural network training) are optimizing closely related objectives. The differences in practice come from implementation details: subsampling of frequent words, the 3/4 power on the noise distribution, stochastic optimization vs. exact SVD, and the dimensionality bottleneck forcing implicit regularization.

Failure Mode

The equivalence holds at the global optimum with infinite data and full-rank embeddings. In practice: (1) embeddings have dimension dVd \ll |V|, so the factorization is approximate and the low-rank constraint acts as implicit regularization; (2) stochastic gradient descent may not reach the global optimum; (3) the 3/4 power on the noise distribution (used in practice but not in the theorem) changes the implicit matrix being factorized.

GloVe: Explicit Co-occurrence Factorization

GloVe (Pennington et al., 2014) takes the connection between co-occurrence and embeddings and makes it explicit. Instead of implicit factorization through a prediction task, GloVe directly optimizes:

J=w,cf(#(w,c))(wc+bw+bclog#(w,c))2J = \sum_{w,c} f(\#(w,c)) \left( \vec{w} \cdot \vec{c} + b_w + b_c - \log \#(w,c) \right)^2

where ff is a weighting function that caps the influence of very frequent pairs (typically f(x)=min((x/xmax)3/4,1)f(x) = \min((x/x_{\max})^{3/4}, 1)) and bw,bcb_w, b_c are bias terms.

The target log#(w,c)\log \#(w,c) can be rewritten as logP(w,c)+const\log P(w,c) + \text{const}, so GloVe is fitting the log co-occurrence (up to bias terms) with a weighted least squares objective. This is a direct matrix factorization, closer in spirit to SVD than to word2vec.

What Distributional Embeddings Capture and Miss

Watch Out

Distributional similarity is not semantic identity

Words with similar distributional profiles may be synonyms ("big" and "large"), but they may also be antonyms ("hot" and "cold"), co-hyponyms ("cat" and "dog"), or words related by other systematic patterns. Distributional methods capture relatedness (words that appear in similar contexts) but do not reliably distinguish the type of relation. The pair (king, queen) and the pair (king, kingdom) are both distributionally similar for different reasons.

Watch Out

Static embeddings assign one vector per word

In distributional models like word2vec and GloVe, each word type gets a single vector regardless of context. The word "bank" has one embedding whether it means a river bank or a financial institution. This is a structural limitation that contextual models (ELMo, BERT, GPT) address by computing word representations conditioned on the surrounding sentence.

What embeddings encode well:

  • Semantic similarity and relatedness (cosine similarity correlates with human judgments)
  • Analogy structure (the famous kingman+womanqueen\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}} pattern)
  • Topical clustering (words from the same domain cluster together)

What embeddings miss:

  • Negation and logical operators ("not good" vs. "good")
  • Compositionality beyond simple addition
  • Rare words and proper nouns (insufficient co-occurrence data)
  • Grounding (no connection to perceptual or physical meaning)

From Static to Contextual: The Transformer Extension

Transformers extend distributional semantics by making word representations context-dependent. Instead of a fixed vector for each word type, a transformer computes a representation for each word token conditioned on the entire input sequence. The self-attention mechanism computes, for each position, a weighted sum over all other positions, where the weights are determined by the content of the tokens. This is distributional semantics applied at the token level within a single sequence, rather than aggregated across a corpus.

The progression: co-occurrence matrix (global statistics) to word2vec (local prediction with global aggregation via SGD) to transformers (local computation, contextual representations). Each step adds expressiveness while preserving the core distributional insight that meaning is determined by context.

Key Takeaways

  • The distributional hypothesis states that word meaning is determined by co-occurrence patterns
  • Raw co-occurrence counts are dominated by frequent words; PMI corrects for expected co-occurrence
  • SVD of the PPMI matrix (LSA) gives optimal low-rank word embeddings by Eckart-Young
  • Word2vec SGNS implicitly factorizes a shifted PMI matrix (Levy and Goldberg, 2014)
  • GloVe explicitly factorizes log co-occurrence with weighted least squares
  • Static embeddings capture relatedness but not relation type, polysemy, or compositionality
  • Contextual models (transformers) extend distributional semantics to token-level representations

Exercises

ExerciseCore

Problem

Consider a tiny corpus with vocabulary V={a,b,c}V = \{a, b, c\} and the following bigram counts (window size 1):

abc
a0102
b1008
c280

The total count is D=40|D| = 40, with marginals #(a)=12\#(a) = 12, #(b)=28\#(b) = 28, #(c)=10\#(c) = 10 (summing the full symmetric matrix, so each co-occurrence counted once per direction). Compute PMI(a,b)\text{PMI}(a, b) and PMI(a,c)\text{PMI}(a, c). Which pair is more associated?

ExerciseCore

Problem

Explain why the Levy-Goldberg result implies that increasing the number of negative samples kk in word2vec is equivalent to applying a stronger threshold on the PMI matrix. What does this mean for the behavior of the resulting embeddings?

ExerciseAdvanced

Problem

Let M+RV×VM^+ \in \mathbb{R}^{|V| \times |V|} be the symmetric PPMI matrix. Suppose you compute the rank-dd truncated SVD M+UdΣdVdM^+ \approx U_d \Sigma_d V_d^\top and form word vectors W=UdΣd1/2W = U_d \Sigma_d^{1/2}. Show that the cosine similarity between word vectors wiw_i and wjw_j in this embedding is related to the inner product of rows ii and jj of the best rank-dd approximation of M+M^+, and explain why this connection justifies using cosine similarity to measure semantic relatedness.

References

Canonical:

  • Harris, "Distributional Structure" (1954), the original distributional hypothesis
  • Firth, Studies in Linguistic Analysis (1957), Chapter 1, "a word is characterized by the company it keeps"
  • Deerwester et al., "Indexing by Latent Semantic Analysis" (1990), the LSA paper: SVD applied to term-document matrices
  • Church & Hanks, "Word Association Norms, Mutual Information, and Lexicography" (1990), PMI for lexical association

Unifying Results:

  • Levy & Goldberg, "Neural Word Embedding as Implicit Matrix Factorization" (NeurIPS 2014), Sections 2-4, the SGNS-PMI equivalence theorem
  • Pennington, Socher, Manning, "GloVe: Global Vectors for Word Representation" (EMNLP 2014), Sections 2-3

Textbooks and Surveys:

  • Jurafsky & Martin, Speech and Language Processing (3rd ed., 2024), Chapters 6-7
  • Turney & Pantel, "From Frequency to Meaning: Vector Space Models of Semantics" (JAIR, 2010), Sections 1-5

Next Topics

  • Word embeddings: word2vec, GloVe, and fastText in detail, training procedures, and evaluation
  • Transformer architecture: contextual representations that extend distributional semantics to the token level

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics