Distributional Semantics

Sneiderman, Robby

ML Methods

Distributional Semantics

You shall know a word by the company it keeps. The distributional hypothesis, co-occurrence matrices, PMI, SVD-based embeddings, and the mathematical bridge from linguistics to word vectors and transformers.

CoreTier 2StableReference~40 min

Prerequisites

Singular Value Decomposition Maximum Likelihood Estimation NLP for Economic Text Analysis Syntax and Parsing

Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 2 | tier 2. This page has 3 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Word Embeddings

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Most of modern NLP rests on a single empirical observation: words that appear in similar contexts tend to have similar meanings. This is the distributional hypothesis. Every word embedding method, from LSA to word2vec to GloVe to the contextual representations in transformers, operationalizes this hypothesis in a specific mathematical way. Understanding the distributional semantics pipeline (co-occurrence matrix, PMI weighting, low-rank factorization) clarifies what these models actually learn and what they cannot capture.

The key result connecting "old-school" count-based methods to "new-school" prediction-based methods is the Levy and Goldberg (2014) theorem: word2vec's skip-gram with negative sampling is implicitly factorizing a shifted PMI matrix. This unifies the two paradigms and shows that the mathematical structure, not the training algorithm, determines what embeddings encode.

The Distributional Hypothesis

The idea originates with Harris (1954) and Firth (1957). Firth's formulation: "You shall know a word by the company it keeps." The formal version:

Distributional Hypothesis: The meaning of a word is determined by the distribution of words that co-occur with it. Two words are semantically similar if and only if they appear in similar linguistic contexts.

This is an empirical claim, not a theorem. It works surprisingly well for certain types of semantic similarity (topical relatedness, synonymy) and poorly for others (antonymy, pragmatic meaning, logical entailment). The word pairs "hot" and "cold" appear in nearly identical contexts but have opposite meanings.

Co-occurrence Matrices

Definition

Word-Context Co-occurrence Matrix $M \in R^{∣ V ∣ \times ∣ C ∣}$

Given a vocabulary $V$ of words and a set $C$ of contexts (typically also words within a window of size $L$ ), the co-occurrence matrix $M$ has entries:

$M_{wc} = \#(w, c)$

where $\#(w, c)$ counts the number of times word $w$ appears within a window of size $L$ of context word $c$ in the corpus. The matrix $M$ is typically sparse and very large ( $|V|$ can be $10^5$ to $10^6$ ).

The choice of context window $L$ matters. Small windows ( $L = 2$ ) capture syntactic similarity (words that are interchangeable in a sentence). Large windows ( $L = 10$ ) capture topical similarity (words that appear in the same documents). This is a design choice, not a mathematical property.

Pointwise Mutual Information

Raw co-occurrence counts are dominated by frequent words ("the", "of", "and"). PMI corrects for this by measuring whether two words co-occur more than expected under independence.

Definition

Pointwise Mutual Information $PMI (w, c)$

The pointwise mutual information between word $w$ and context $c$ is:

$\text{PMI}(w, c) = \log \frac{P(w, c)}{P(w) \cdot P(c)} = \log \frac{\#(w, c) \cdot |D|}{\#(w) \cdot \#(c)}$

where $|D| = \sum_{w,c} \#(w,c)$ is the total number of word-context pairs. PMI is positive when $w$ and $c$ co-occur more than expected, zero when independent, and negative when they co-occur less than expected.

Positive PMI (PPMI): In practice, negative PMI values are unreliable (they require observing that a rare event is even rarer than expected, which needs enormous corpora). The standard fix is:

$\text{PPMI}(w, c) = \max(0, \text{PMI}(w, c))$

Watch Out

PMI is not symmetric in general practice

Mathematically, $\text{PMI}(w, c) = \text{PMI}(c, w)$ since the joint probability is symmetric. However, in implementations where the context set $C$ differs from the vocabulary $V$ (e.g., using dependency relations as contexts), the resulting matrix need not be square or symmetric. The symmetry holds only when word and context vocabularies coincide and the co-occurrence relation is symmetric.

SVD-Based Embeddings (Latent Semantic Analysis)

The PPMI matrix is high-dimensional ( $|V| \times |C|$ ) and sparse. Singular value decomposition provides the optimal low-rank approximation (by the Eckart-Young theorem), giving dense, low-dimensional word vectors.

Definition

LSA Embeddings via Truncated SVD

Given the PPMI matrix $M^+ \in \mathbb{R}^{|V| \times |C|}$ , compute the truncated SVD:

$M^+ \approx U_d \Sigma_d V_d^\top$

where $d \ll \min(|V|, |C|)$ is the embedding dimension (typically 100-300). The word embeddings are:

$W = U_d \Sigma_d^{1/2} \in \mathbb{R}^{|V| \times d}$

Each row of $W$ is the $d$ -dimensional vector for a word. The context embeddings are $C = V_d \Sigma_d^{1/2}$ . The power $1/2$ on $\Sigma_d$ distributes the singular values equally between word and context representations. Some implementations use $W = U_d \Sigma_d^\alpha$ for $\alpha \in [0, 1]$ , with $\alpha = 1$ and $\alpha = 0$ as common alternatives.

This is exactly the Latent Semantic Analysis (LSA) method of Deerwester et al. (1990), though they applied it to the term-document matrix rather than the word-context matrix. The mathematical machinery is identical.

The Skip-Gram and PMI Connection

Word2vec's skip-gram model with negative sampling (SGNS) trains word and context vectors $\vec{w}$ and $\vec{c}$ by maximizing:

$J = \sum_{(w,c) \in D} \log \sigma(\vec{w} \cdot \vec{c}) + k \cdot \mathbb{E}_{c' \sim P_n} [\log \sigma(-\vec{w} \cdot \vec{c'})]$

where $\sigma$ is the sigmoid function, $k$ is the number of negative samples, and $P_n$ is the noise distribution (typically the unigram distribution raised to the 3/4 power).

Theorem

Skip-Gram with Negative Sampling Implicitly Factorizes a Shifted PMI Matrix

Statement

Let $\vec{w}$ and $\vec{c}$ be the word and context vectors learned by word2vec SGNS with $k$ negative samples. At the global optimum of the SGNS objective:

$\vec{w} \cdot \vec{c} = \text{PMI}(w, c) - \log k$

The skip-gram model implicitly factorizes the matrix $M^*$ where $M^*_{wc} = \text{PMI}(w, c) - \log k$ , which is the PMI matrix shifted by $\log k$ .

Intuition

Each word-context pair $(w, c)$ contributes a logistic regression problem: distinguish the real pair from $k$ noise pairs. The optimal solution to this binary classification sets the log-odds equal to the log ratio of the data distribution to the noise distribution, which is exactly PMI (up to the shift from the $k$ negative samples). The dot product of the learned vectors converges to this PMI value.

Proof Sketch

For a fixed word-context pair $(w, c)$ , the SGNS objective for that pair is:

$\ell_{wc} = \#(w,c) \log \sigma(x) + k \cdot \#(w) \cdot P_n(c) \cdot \log \sigma(-x)$

where $x = \vec{w} \cdot \vec{c}$ . Taking the derivative and setting it to zero:

$\#(w,c) \cdot \sigma(-x) = k \cdot \#(w) \cdot P_n(c) \cdot \sigma(x)$

Solving for $x$ using $\sigma(-x)/\sigma(x) = e^{-x}$ :

$e^x = \frac{\#(w,c)}{k \cdot \#(w) \cdot P_n(c)}$

When $P_n(c) = \#(c)/|D|$ :

$x = \log \frac{\#(w,c) \cdot |D|}{\#(w) \cdot \#(c)} - \log k = \text{PMI}(w,c) - \log k$

Why It Matters

This theorem unifies count-based and prediction-based word embedding methods. The two paradigms that the NLP community treated as distinct (SVD of PMI matrix vs. neural network training) are optimizing closely related objectives. The differences in practice come from implementation details: subsampling of frequent words, the 3/4 power on the noise distribution, stochastic optimization vs. exact SVD, and the dimensionality bottleneck forcing implicit regularization.

Failure Mode

The equivalence holds at the global optimum with infinite data and full-rank embeddings. In practice: (1) embeddings have dimension $d \ll |V|$ , so the factorization is approximate and the low-rank constraint acts as implicit regularization; (2) stochastic gradient descent may not reach the global optimum; (3) the 3/4 power on the noise distribution (used in practice but not in the theorem) changes the implicit matrix being factorized.

report a correction →

GloVe: Explicit Co-occurrence Factorization

GloVe (Pennington et al., 2014) takes the connection between co-occurrence and embeddings and makes it explicit. Instead of implicit factorization through a prediction task, GloVe directly optimizes:

$J = \sum_{w,c} f(\#(w,c)) \left( \vec{w} \cdot \vec{c} + b_w + b_c - \log \#(w,c) \right)^2$

where $f$ is a weighting function that caps the influence of very frequent pairs (typically $f(x) = \min((x/x_{\max})^{3/4}, 1)$ ) and $b_w, b_c$ are bias terms.

The target $\log \#(w,c)$ can be rewritten as $\log P(w,c) + \text{const}$ , so GloVe is fitting the log co-occurrence (up to bias terms) with a weighted least squares objective. This is a direct matrix factorization, closer in spirit to SVD than to word2vec.

What Distributional Embeddings Capture and Miss

Watch Out

Distributional similarity is not semantic identity

Words with similar distributional profiles may be synonyms ("big" and "large"), but they may also be antonyms ("hot" and "cold"), co-hyponyms ("cat" and "dog"), or words related by other systematic patterns. Distributional methods capture relatedness (words that appear in similar contexts) but do not reliably distinguish the type of relation. The pair (king, queen) and the pair (king, kingdom) are both distributionally similar for different reasons.

Watch Out

Static embeddings assign one vector per word

In distributional models like word2vec and GloVe, each word type gets a single vector regardless of context. The word "bank" has one embedding whether it means a river bank or a financial institution. This is a structural limitation that contextual models (ELMo, BERT, GPT) address by computing word representations conditioned on the surrounding sentence.

What embeddings encode well:

Semantic similarity and relatedness (cosine similarity correlates with human judgments)
Analogy structure (the famous $\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}$ pattern)
Topical clustering (words from the same domain cluster together)

What embeddings miss:

Negation and logical operators ("not good" vs. "good")
Compositionality beyond simple addition
Rare words and proper nouns (insufficient co-occurrence data)
Grounding (no connection to perceptual or physical meaning)

From Static to Contextual: The Transformer Extension

Transformers extend distributional semantics by making word representations context-dependent. Instead of a fixed vector for each word type, a transformer computes a representation for each word token conditioned on the entire input sequence. The self-attention mechanism computes, for each position, a weighted sum over all other positions, where the weights are determined by the content of the tokens. This is distributional semantics applied at the token level within a single sequence, rather than aggregated across a corpus.

The progression: co-occurrence matrix (global statistics) to word2vec (local prediction with global aggregation via SGD) to transformers (local computation, contextual representations). Each step adds expressiveness while preserving the core distributional insight that meaning is determined by context.

Summary

The distributional hypothesis states that word meaning is determined by co-occurrence patterns
Raw co-occurrence counts are dominated by frequent words; PMI corrects for expected co-occurrence
SVD of the PPMI matrix (LSA) gives optimal low-rank word embeddings by Eckart-Young
Word2vec SGNS implicitly factorizes a shifted PMI matrix (Levy and Goldberg, 2014)
GloVe explicitly factorizes log co-occurrence with weighted least squares
Static embeddings capture relatedness but not relation type, polysemy, or compositionality
Contextual models (transformers) extend distributional semantics to token-level representations

Exercises

ExerciseCore

Problem

Consider a tiny corpus with vocabulary $V = \{a, b, c\}$ and the following bigram counts (window size 1):

	a	b	c
a	0	10	2
b	10	0	8
c	2	8	0

The total count is $|D| = 40$ , with marginals $\#(a) = 12$ , $\#(b) = 28$ , $\#(c) = 10$ (summing the full symmetric matrix, so each co-occurrence counted once per direction). Compute $\text{PMI}(a, b)$ and $\text{PMI}(a, c)$ . Which pair is more associated?

ExerciseCore

Problem

Explain why the Levy-Goldberg result implies that increasing the number of negative samples $k$ in word2vec is equivalent to applying a stronger threshold on the PMI matrix. What does this mean for the behavior of the resulting embeddings?

ExerciseAdvanced

Problem

Let $M^+ \in \mathbb{R}^{|V| \times |V|}$ be the symmetric PPMI matrix. Suppose you compute the rank- $d$ truncated SVD $M^+ \approx U_d \Sigma_d V_d^\top$ and form word vectors $W = U_d \Sigma_d^{1/2}$ . Show that the cosine similarity between word vectors $w_i$ and $w_j$ in this embedding is related to the inner product of rows $i$ and $j$ of the best rank- $d$ approximation of $M^+$ , and explain why this connection justifies using cosine similarity to measure semantic relatedness.

References

Canonical:

Harris, "Distributional Structure" (1954), the original distributional hypothesis
Firth, Studies in Linguistic Analysis (1957), Chapter 1, "a word is characterized by the company it keeps"
Deerwester et al., "Indexing by Latent Semantic Analysis" (1990), the LSA paper: SVD applied to term-document matrices
Church & Hanks, "Word Association Norms, Mutual Information, and Lexicography" (1990), PMI for lexical association

Unifying Results:

Levy & Goldberg, "Neural Word Embedding as Implicit Matrix Factorization" (NeurIPS 2014), Sections 2-4, the SGNS-PMI equivalence theorem
Pennington, Socher, Manning, "GloVe: Global Vectors for Word Representation" (EMNLP 2014), Sections 2-3

Textbooks and Surveys:

Jurafsky & Martin, Speech and Language Processing (3rd ed., 2024), Chapters 6-7
Turney & Pantel, "From Frequency to Meaning: Vector Space Models of Semantics" (JAIR, 2010), Sections 1-5

Next Topics

Word embeddings: word2vec, GloVe, and fastText in detail, training procedures, and evaluation
Transformer architecture: contextual representations that extend distributional semantics to the token level

Last reviewed: April 15, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Singular Value Decompositionlayer 0A · tier 1
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
NLP for Economic Text Analysislayer 4 · tier 3

Derived topics

1

Transformer Architecturelayer 4 · tier 2

Graph-backed continuations

Transformer Architecture