Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Word Embeddings

Dense vector representations of words: Word2Vec (skip-gram, CBOW), negative sampling, GloVe, the distributional hypothesis, and why embeddings transformed NLP from sparse features to learned representations.

CoreTier 2Stable~45 min
0

Why This Matters

Before word embeddings, NLP systems represented words as one-hot vectors --- sparse, high-dimensional, and devoid of semantic content. The word "king" and the word "queen" were as far apart as "king" and "refrigerator." Every NLP model had to discover semantic relationships from scratch.

Word embeddings changed this by learning dense, low-dimensional representations where semantic similarity is captured by geometric proximity. They showed that unsupervised learning on raw text could produce representations encoding rich linguistic structure --- a finding that foreshadowed the representation learning revolution leading to transformers and large language models.

Mental Model

The distributional hypothesis is the foundation: a word is characterized by the company it keeps. Words that appear in similar contexts should have similar representations. This core idea belongs to the field of distributional semantics, which formalizes meaning through co-occurrence statistics.

Word embedding methods operationalize this idea: train a model to predict context from words (or vice versa), and use the learned parameters as word representations. The prediction task is a pretext --- the real goal is the representations that emerge as a byproduct.

Think of each word as a point in Rd\mathbb{R}^d (typically d=100d = 100 to 300300). Training positions the points so that words appearing in similar contexts end up nearby. The resulting geometry captures semantic relationships: directions in the embedding space correspond to semantic relationships like gender, tense, and plurality.

Word2Vec: Skip-Gram and CBOW

Definition

Skip-Gram Model

The skip-gram model predicts context words from a center word. Given a center word wcw_c and a context window of size kk, the model maximizes the probability of observing the context words wck,,wc+kw_{c-k}, \ldots, w_{c+k} (excluding wcw_c):

maxθt=1Tkjkj0logP(wt+jwt;θ)\max_\theta \sum_{t=1}^{T} \sum_{\substack{-k \leq j \leq k \\ j \neq 0}} \log P(w_{t+j} \mid w_t; \theta)

where TT is the corpus size and the conditional probability uses the softmax over the full vocabulary VV:

P(wowc)=exp(uwovwc)wVexp(uwvwc)P(w_o \mid w_c) = \frac{\exp(u_{w_o}^\top v_{w_c})}{\sum_{w \in V} \exp(u_w^\top v_{w_c})}

Here vwRdv_w \in \mathbb{R}^d is the center embedding and uwRdu_w \in \mathbb{R}^d is the context embedding for word ww. Each word has two vectors; typically the center embeddings are used as the final word representations.

Definition

CBOW (Continuous Bag of Words)

CBOW predicts the center word from its context. Given context words wck,,wc+kw_{c-k}, \ldots, w_{c+k} (excluding wcw_c), the model predicts wcw_c:

P(wccontext)=exp(uwcvˉ)wVexp(uwvˉ)P(w_c \mid \text{context}) = \frac{\exp(u_{w_c}^\top \bar{v})}{\sum_{w \in V} \exp(u_w^\top \bar{v})}

where vˉ=12kj0vwc+j\bar{v} = \frac{1}{2k}\sum_{j \neq 0} v_{w_{c+j}} is the average of the context word embeddings.

CBOW is faster to train than skip-gram because it makes one prediction per context window instead of 2k2k predictions. Skip-gram tends to perform better on rare words because each occurrence produces multiple training examples.

Negative Sampling

The softmax denominator wVexp(uwvwc)\sum_{w \in V} \exp(u_w^\top v_{w_c}) sums over the entire vocabulary, which can be hundreds of thousands of words. Computing this for every training step is prohibitively expensive.

Definition

Negative Sampling

Negative sampling replaces the full softmax objective with a binary classification task. For each positive pair (wc,wo)(w_c, w_o) (center word and actual context word), sample KK negative words w1,,wKw_1^-, \ldots, w_K^- from a noise distribution Pn(w)P_n(w) (typically Pn(w)(count(w))3/4P_n(w) \propto (\text{count}(w))^{3/4}). The objective becomes:

logσ(uwovwc)+k=1KEwkPn ⁣[logσ(uwkvwc)]\log \sigma(u_{w_o}^\top v_{w_c}) + \sum_{k=1}^{K} \mathbb{E}_{w_k^- \sim P_n}\!\left[\log \sigma(-u_{w_k^-}^\top v_{w_c})\right]

where σ(x)=1/(1+ex)\sigma(x) = 1/(1 + e^{-x}) is the sigmoid function. This maximizes the dot product between the center word and its actual context while minimizing it with random negative samples.

Typical values: K=5K = 5 to 1515 for small datasets, K=2K = 2 to 55 for large datasets. The 3/43/4 power in the noise distribution upweights rare words relative to their frequency, preventing common words from dominating.

Negative sampling is not just a computational trick. Levy and Goldberg (2014) showed that the skip-gram with negative sampling objective implicitly factorizes the pointwise mutual information (PMI) matrix of word co-occurrences, shifted by logK\log K. This connects the prediction-based approach of Word2Vec to the count-based approach of GloVe.

GloVe: Global Vectors

Definition

GloVe (Global Vectors for Word Representation)

GloVe directly factorizes the word co-occurrence matrix. Let XijX_{ij} be the number of times word jj appears in the context of word ii. GloVe minimizes:

i,j=1Vf(Xij)(wiw~j+bi+b~jlogXij)2\sum_{i,j=1}^{|V|} f(X_{ij})\left(w_i^\top \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij}\right)^2

where wi,w~jRdw_i, \tilde{w}_j \in \mathbb{R}^d are word and context vectors, bi,b~jb_i, \tilde{b}_j are bias terms, and ff is a weighting function that downweights very frequent co-occurrences:

f(x)={(x/xmax)αif x<xmax1otherwisef(x) = \begin{cases} (x/x_{\max})^\alpha & \text{if } x < x_{\max} \\ 1 & \text{otherwise} \end{cases}

with typical values xmax=100x_{\max} = 100, α=3/4\alpha = 3/4.

GloVe uses the global co-occurrence statistics of the corpus (the full matrix XX), while Word2Vec operates on local context windows. In practice, both methods produce embeddings of similar quality.

Main Theorems

Proposition

Skip-Gram with Negative Sampling Factorizes the PMI Matrix

Statement

The skip-gram model with negative sampling (SGNS), in the limit of a large corpus, has the property that at the global optimum:

uwvc=PMI(w,c)logKu_w^\top v_c = \text{PMI}(w, c) - \log K

where PMI(w,c)=logP(w,c)P(w)P(c)\text{PMI}(w, c) = \log \frac{P(w, c)}{P(w)P(c)} is the pointwise mutual information of the word-context pair, and KK is the number of negative samples.

In matrix form: the optimal embeddings satisfy UV=MlogKUV^\top = M - \log K where Mij=PMI(wi,cj)M_{ij} = \text{PMI}(w_i, c_j).

Intuition

The skip-gram model is implicitly performing a matrix factorization of the PMI matrix, shifted by a constant that depends on the number of negative samples. Words that co-occur more often than expected by chance (positive PMI) get embeddings with positive dot product. Words that co-occur less often than expected (negative PMI) get embeddings with negative dot product.

This connection unifies the "prediction-based" view (Word2Vec) with the "count-based" view (GloVe, PMI): both are doing matrix factorization of co-occurrence statistics, just with different weighting schemes.

Proof Sketch

For a fixed word-context pair (w,c)(w, c), the SGNS objective for that pair (averaged over the corpus and negative samples) is:

#(w,c)logσ(ucvw)+K#(w)Pn(c)logσ(ucvw)\#(w, c) \cdot \log \sigma(u_c^\top v_w) + K \cdot \#(w) \cdot P_n(c) \cdot \log \sigma(-u_c^\top v_w)

Setting the derivative with respect to ucvwu_c^\top v_w to zero:

#(w,c)σ(ucvw)=K#(w)Pn(c)σ(ucvw)\#(w, c) \cdot \sigma(-u_c^\top v_w) = K \cdot \#(w) \cdot P_n(c) \cdot \sigma(u_c^\top v_w)

Using Pn(c)#(c)/DP_n(c) \approx \#(c)/|D| (unigram distribution) and solving:

ucvw=log#(w,c)D#(w)#(c)logK=PMI(w,c)logKu_c^\top v_w = \log \frac{\#(w, c) \cdot |D|}{\#(w) \cdot \#(c)} - \log K = \text{PMI}(w, c) - \log K

Why It Matters

This result provides theoretical grounding for why Word2Vec works. It is not learning arbitrary correlations --- it is recovering the statistical structure of word co-occurrences in a low-rank form. The PMI matrix captures the essential information about word relationships, and the embedding dimension dd controls the rank of the approximation.

Failure Mode

The proof assumes the embedding dimension dd is large enough to exactly represent the PMI matrix, which it is not in practice (d300d \approx 300 while V100,000|V| \approx 100{,}000). In practice, SGNS finds a low-rank approximation to the shifted PMI matrix. Also, the noise distribution Pn(w)count(w)3/4P_n(w) \propto \text{count}(w)^{3/4} differs from the unigram distribution, which modifies the exact factorization relationship.

Word Analogies and Embedding Geometry

The most striking property of word embeddings is that they encode semantic relationships as vector arithmetic:

vkingvman+vwomanvqueenv_{\text{king}} - v_{\text{man}} + v_{\text{woman}} \approx v_{\text{queen}}

This works because the direction vkingvmanv_{\text{king}} - v_{\text{man}} captures the "royalty" component, and adding it to vwomanv_{\text{woman}} yields the female royal. More generally, if a semantic relationship RR maps word aa to word bb and word cc to word dd, then vbvavdvcv_b - v_a \approx v_d - v_c.

This is not a built-in feature --- it emerges from the training process. The explanation: if the PMI matrix has the structure that shifting from "man" to "woman" affects co-occurrence patterns the same way as shifting from "king" to "queen," then the optimal embeddings will encode this shift as a consistent vector direction.

Canonical Examples

Example

Analogy completion

Given the query "man:king :: woman:?", find the word ww that maximizes:

cos(vw,vkingvman+vwoman)\cos(v_w, v_{\text{king}} - v_{\text{man}} + v_{\text{woman}})

On standard embedding models trained on large corpora, this returns "queen" with high accuracy. Other analogy types that work: Paris:France :: Berlin:? (Germany), walking:walked :: swimming:? (swam).

Analogies work best for frequent words with clear relationships. They fail for rare words, ambiguous words, and relationships that do not correspond to linear subspaces in the embedding space.

Example

Nearest neighbors reveal semantic clusters

The nearest neighbors of "python" (by cosine similarity) in a Word2Vec model trained on Wikipedia include both the programming language sense (java, perl, ruby) and the animal sense (snake, reptile, viper). This illustrates a fundamental limitation of static word embeddings: each word gets a single vector regardless of the number of senses. Contextual embeddings (ELMo, BERT, GPT) address this by computing a different representation for each occurrence.

Common Confusions

Watch Out

Word2Vec is not deep learning

Word2Vec is a shallow model: one hidden layer for skip-gram, no hidden layers for CBOW (just an averaging operation). It does not use backpropagation through multiple layers. Its power comes from the training objective and the massive scale of data, not from depth. The terminology "neural word embeddings" is somewhat misleading.

Watch Out

Skip-gram and CBOW learn different things

Skip-gram predicts context from center word; CBOW predicts center word from context. They are not equivalent. Skip-gram gives each word-context pair equal weight, which helps rare words. CBOW averages context vectors, which smooths noise and trains faster. For most applications, skip-gram with negative sampling is the default choice.

Watch Out

Static embeddings have been largely superseded

Word2Vec and GloVe produce static embeddings: one vector per word type, regardless of context. Modern NLP uses contextual embeddings from transformers (BERT, GPT), where each word token gets a context-dependent representation. Static embeddings are still useful as a conceptual foundation and for resource-constrained settings. They were state-of-the-art for most NLP tasks in the pre-transformer era (2013-2017), displaced by BERT (Devlin et al. 2019) and subsequent contextual embeddings.

Summary

  • Distributional hypothesis: words are defined by their context
  • Skip-gram: predict context words from center word
  • CBOW: predict center word from averaged context vectors
  • Negative sampling approximates the softmax; typically K=5K = 5 to 1515 negatives
  • SGNS implicitly factorizes the shifted PMI matrix: uv=PMIlogKu^\top v = \text{PMI} - \log K
  • GloVe directly factorizes the co-occurrence matrix with weighted least squares
  • Word analogies (vkingvman+vwomanvqueenv_{\text{king}} - v_{\text{man}} + v_{\text{woman}} \approx v_{\text{queen}}) emerge from the geometry
  • Static embeddings give one vector per word type --- no sense disambiguation
  • Word embeddings were the bridge from sparse NLP to representation learning

Exercises

ExerciseCore

Problem

In the skip-gram model with vocabulary size V=50,000|V| = 50{,}000 and embedding dimension d=300d = 300, how many parameters does the model have? Why is the full softmax prohibitively expensive, and how does negative sampling with K=10K = 10 reduce the cost per training example?

ExerciseAdvanced

Problem

Show that the SGNS objective at its optimum satisfies ucvw=PMI(w,c)logKu_c^\top v_w = \text{PMI}(w, c) - \log K. Start from the per-pair objective, take the derivative, and solve.

ExerciseAdvanced

Problem

Word embeddings are known to encode societal biases present in training data (e.g., gender stereotypes in occupation words). Explain how this arises from the distributional hypothesis and the PMI factorization, and why debiasing by removing a "gender direction" from the embedding space is a mathematically principled but practically incomplete solution.

References

Canonical:

  • Mikolov et al., "Efficient Estimation of Word Representations in Vector Space" (2013) --- Word2Vec
  • Mikolov et al., "Distributed Representations of Words and Phrases and their Compositionality" (NeurIPS 2013) --- negative sampling
  • Pennington, Socher, Manning, "GloVe: Global Vectors for Word Representation" (EMNLP 2014)

Current:

  • Levy & Goldberg, "Neural Word Embedding as Implicit Matrix Factorization" (NeurIPS 2014) --- the PMI factorization result
  • Bolukbasi et al., "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings" (NeurIPS 2016)

Next Topics

From word embeddings, the natural next steps are:

Last reviewed: April 2026

Builds on This

Next Topics