Word Embeddings

Sneiderman, Robby

ML Methods

Word Embeddings

Dense vector representations of words: Word2Vec (skip-gram, CBOW), negative sampling, GloVe, the distributional hypothesis, and why embeddings transformed NLP from sparse features to learned representations.

CoreTier 2StableSupporting~45 min

Prerequisites

Logistic Regression Singular Value Decomposition Maximum Likelihood Estimation Information Retrieval

Start 8-question practice · 6 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 2 | tier 2. This page has 4 direct prerequisites and 5 published dependents.

Open Atlas Prerequisites Leads to

What next

Attention Mechanism Theory

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Before word embeddings, NLP systems represented words as one-hot vectors --- sparse, high-dimensional, and devoid of semantic content. The word "king" and the word "queen" were as far apart as "king" and "refrigerator." Every NLP model had to discover semantic relationships from scratch.

Word embeddings changed this by learning dense, low-dimensional representations where semantic similarity is captured by geometric proximity. They showed that unsupervised learning on raw text could produce representations encoding rich linguistic structure --- a finding that foreshadowed the representation learning revolution leading to transformers and large language models.

Mental Model

The distributional hypothesis is the foundation: a word is characterized by the company it keeps. Words that appear in similar contexts should have similar representations. This core idea belongs to the field of distributional semantics, which formalizes meaning through co-occurrence statistics.

Word embedding methods operationalize this idea: train a model to predict context from words (or vice versa), and use the learned parameters as word representations. The prediction task is a pretext --- the real goal is the representations that emerge as a byproduct.

Think of each word as a point in $\mathbb{R}^d$ (typically $d = 100$ to $300$ ). Training positions the points so that words appearing in similar contexts end up nearby. The resulting geometry captures semantic relationships: directions in the embedding space correspond to semantic relationships like gender, tense, and plurality.

Word2Vec: Skip-Gram and CBOW

Definition

Skip-Gram Model

The skip-gram model predicts context words from a center word. Given a center word $w_c$ and a context window of size $k$ , the model maximizes the probability of observing the context words $w_{c-k}, \ldots, w_{c+k}$ (excluding $w_c$ ):

$\max_\theta \sum_{t=1}^{T} \sum_{\substack{-k \leq j \leq k \\ j \neq 0}} \log P(w_{t+j} \mid w_t; \theta)$

where $T$ is the corpus size and the conditional probability uses the softmax over the full vocabulary $V$ :

$P(w_o \mid w_c) = \frac{\exp(u_{w_o}^\top v_{w_c})}{\sum_{w \in V} \exp(u_w^\top v_{w_c})}$

Here $v_w \in \mathbb{R}^d$ is the center embedding and $u_w \in \mathbb{R}^d$ is the context embedding for word $w$ . Each word has two vectors; typically the center embeddings are used as the final word representations.

Definition

CBOW (Continuous Bag of Words)

CBOW predicts the center word from its context. Given context words $w_{c-k}, \ldots, w_{c+k}$ (excluding $w_c$ ), the model predicts $w_c$ :

$P(w_c \mid \text{context}) = \frac{\exp(u_{w_c}^\top \bar{v})}{\sum_{w \in V} \exp(u_w^\top \bar{v})}$

where $\bar{v} = \frac{1}{2k}\sum_{j \neq 0} v_{w_{c+j}}$ is the average of the context word embeddings.

CBOW is faster to train than skip-gram because it makes one prediction per context window instead of $2k$ predictions. Skip-gram tends to perform better on rare words because each occurrence produces multiple training examples.

Negative Sampling

The softmax denominator $\sum_{w \in V} \exp(u_w^\top v_{w_c})$ sums over the entire vocabulary, which can be hundreds of thousands of words. Computing this for every training step is prohibitively expensive.

Definition

Negative Sampling $N E G$

Negative sampling replaces the full softmax objective with a binary classification task. For each positive pair $(w_c, w_o)$ (center word and actual context word), sample $K$ negative words $w_1^-, \ldots, w_K^-$ from a noise distribution $P_n(w)$ (typically $P_n(w) \propto (\text{count}(w))^{3/4}$ ). The objective becomes:

$\log \sigma(u_{w_o}^\top v_{w_c}) + \sum_{k=1}^{K} \mathbb{E}_{w_k^- \sim P_n}\!\left[\log \sigma(-u_{w_k^-}^\top v_{w_c})\right]$

where $\sigma(x) = 1/(1 + e^{-x})$ is the sigmoid function. This maximizes the dot product between the center word and its actual context while minimizing it with random negative samples.

Typical values: $K = 5$ to $15$ for small datasets, $K = 2$ to $5$ for large datasets. The $3/4$ power in the noise distribution upweights rare words relative to their frequency, preventing common words from dominating.

Negative sampling is not just a computational trick. Levy and Goldberg (2014) showed that the skip-gram with negative sampling objective implicitly factorizes the pointwise mutual information (PMI) matrix of word co-occurrences, shifted by $\log K$ . This connects the prediction-based approach of Word2Vec to the count-based approach of GloVe.

GloVe: Global Vectors

Definition

GloVe (Global Vectors for Word Representation)

GloVe directly factorizes the word co-occurrence matrix. Let $X_{ij}$ be the number of times word $j$ appears in the context of word $i$ . GloVe minimizes:

$\sum_{i,j=1}^{|V|} f(X_{ij})\left(w_i^\top \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij}\right)^2$

where $w_i, \tilde{w}_j \in \mathbb{R}^d$ are word and context vectors, $b_i, \tilde{b}_j$ are bias terms, and $f$ is a weighting function that downweights very frequent co-occurrences:

$f(x) = \begin{cases} (x/x_{\max})^\alpha & \text{if } x < x_{\max} \\ 1 & \text{otherwise} \end{cases}$

with typical values $x_{\max} = 100$ , $\alpha = 3/4$ .

GloVe uses the global co-occurrence statistics of the corpus (the full matrix $X$ ), while Word2Vec operates on local context windows. In practice, both methods produce embeddings of similar quality.

Main Theorems

Proposition

Skip-Gram with Negative Sampling Factorizes the PMI Matrix

Statement

In the unconstrained / sufficiently-high-rank idealization (each word-context pair has its own free scalar $u_w^\top v_c$ ), the SGNS objective in the large-corpus limit is minimized when

$u_w^\top v_c = \text{PMI}(w, c) - \log K$

where $\text{PMI}(w, c) = \log \frac{P(w, c)}{P(w)P(c)}$ is the pointwise mutual information of the word-context pair, and $K$ is the number of negative samples.

With a finite embedding dimension $d$ , however, $UV^\top$ has rank at most $d$ and generally cannot equal an arbitrary shifted-PMI matrix $M - \log K$ . The honest matrix-form statement is that SGNS solves a weighted low-rank factorization of the shifted-PMI matrix: the loss gives more weight to frequent $(w, c)$ pairs, and the learned $UV^\top$ is the rank- $d$ approximation of $M - \log K$ under that weighting (Levy & Goldberg 2014).

Intuition

The skip-gram model is implicitly performing a matrix factorization of the PMI matrix, shifted by a constant that depends on the number of negative samples. Words that co-occur more often than expected by chance (positive PMI) get embeddings with positive dot product. Words that co-occur less often than expected (negative PMI) get embeddings with negative dot product.

This connection unifies the "prediction-based" view (Word2Vec) with the "count-based" view (GloVe, PMI): both are doing matrix factorization of co-occurrence statistics, just with different weighting schemes.

Proof Sketch

For a fixed word-context pair $(w, c)$ , the SGNS objective for that pair (averaged over the corpus and negative samples) is:

$\#(w, c) \cdot \log \sigma(u_c^\top v_w) + K \cdot \#(w) \cdot P_n(c) \cdot \log \sigma(-u_c^\top v_w)$

Setting the derivative with respect to $u_c^\top v_w$ to zero:

$\#(w, c) \cdot \sigma(-u_c^\top v_w) = K \cdot \#(w) \cdot P_n(c) \cdot \sigma(u_c^\top v_w)$

Using $P_n(c) \approx \#(c)/|D|$ (unigram distribution) and solving:

$u_c^\top v_w = \log \frac{\#(w, c) \cdot |D|}{\#(w) \cdot \#(c)} - \log K = \text{PMI}(w, c) - \log K$

Why It Matters

This result provides theoretical grounding for why Word2Vec works. It is not learning arbitrary correlations --- it is recovering the statistical structure of word co-occurrences in a low-rank form. The PMI matrix captures the essential information about word relationships, and the embedding dimension $d$ controls the rank of the approximation.

Failure Mode

The proof assumes the embedding dimension $d$ is large enough to exactly represent the PMI matrix, which it is not in practice ( $d \approx 300$ while $|V| \approx 100{,}000$ ). In practice, SGNS finds a low-rank approximation to the shifted PMI matrix. Also, the noise distribution $P_n(w) \propto \text{count}(w)^{3/4}$ differs from the unigram distribution, which modifies the exact factorization relationship.

report a correction →

Word Analogies and Embedding Geometry

The most striking property of word embeddings is that they encode semantic relationships as vector arithmetic:

$v_{\text{king}} - v_{\text{man}} + v_{\text{woman}} \approx v_{\text{queen}}$

This works because the direction $v_{\text{king}} - v_{\text{man}}$ captures the "royalty" component, and adding it to $v_{\text{woman}}$ yields the female royal. More generally, if a semantic relationship $R$ maps word $a$ to word $b$ and word $c$ to word $d$ , then $v_b - v_a \approx v_d - v_c$ .

This is not a built-in feature --- it emerges from the training process. The explanation: if the PMI matrix has the structure that shifting from "man" to "woman" affects co-occurrence patterns the same way as shifting from "king" to "queen," then the optimal embeddings will encode this shift as a consistent vector direction.

Canonical Examples

Example

Analogy completion

Given the query "man:king :: woman:?", find the word $w$ that maximizes:

$\cos(v_w, v_{\text{king}} - v_{\text{man}} + v_{\text{woman}})$

On standard embedding models trained on large corpora, this returns "queen" with high accuracy. Other analogy types that work: Paris:France :: Berlin:? (Germany), walking:walked :: swimming:? (swam).

Analogies work best for frequent words with clear relationships. They fail for rare words, ambiguous words, and relationships that do not correspond to linear subspaces in the embedding space.

Example

Nearest neighbors reveal semantic clusters

The nearest neighbors of "python" (by cosine similarity) in a Word2Vec model trained on Wikipedia include both the programming language sense (java, perl, ruby) and the animal sense (snake, reptile, viper). This illustrates a fundamental limitation of static word embeddings: each word gets a single vector regardless of the number of senses. Contextual embeddings (ELMo, BERT, GPT) address this by computing a different representation for each occurrence.

Common Confusions

Watch Out

Word2Vec is not deep learning

Word2Vec is a shallow model: one hidden layer for skip-gram, no hidden layers for CBOW (just an averaging operation). It does not use backpropagation through multiple layers. Its power comes from the training objective and the massive scale of data, not from depth. The terminology "neural word embeddings" is somewhat misleading.

Watch Out

Skip-gram and CBOW learn different things

Skip-gram predicts context from center word; CBOW predicts center word from context. They are not equivalent. Skip-gram gives each word-context pair equal weight, which helps rare words. CBOW averages context vectors, which smooths noise and trains faster. For most applications, skip-gram with negative sampling is the default choice.

Watch Out

Static embeddings have been largely superseded

Word2Vec and GloVe produce static embeddings: one vector per word type, regardless of context. Modern NLP uses contextual embeddings from transformers (BERT, GPT), where each word token gets a context-dependent representation. Static embeddings are still useful as a conceptual foundation and for resource-constrained settings. They were state-of-the-art for most NLP tasks in the pre-transformer era (2013-2017), displaced by BERT (Devlin et al. 2019) and subsequent contextual embeddings.

Summary

Distributional hypothesis: words are defined by their context
Skip-gram: predict context words from center word
CBOW: predict center word from averaged context vectors
Negative sampling approximates the softmax; typically $K = 5$ to $15$ negatives
SGNS implicitly factorizes the shifted PMI matrix: $u^\top v = \text{PMI} - \log K$
GloVe directly factorizes the co-occurrence matrix with weighted least squares
Word analogies ( $v_{\text{king}} - v_{\text{man}} + v_{\text{woman}} \approx v_{\text{queen}}$ ) emerge from the geometry
Static embeddings give one vector per word type --- no sense disambiguation
Word embeddings were the bridge from sparse NLP to representation learning

Exercises

ExerciseCore

Problem

In the skip-gram model with vocabulary size $|V| = 50{,}000$ and embedding dimension $d = 300$ , how many parameters does the model have? Why is the full softmax prohibitively expensive, and how does negative sampling with $K = 10$ reduce the cost per training example?

ExerciseAdvanced

Problem

Show that the SGNS objective at its optimum satisfies $u_c^\top v_w = \text{PMI}(w, c) - \log K$ . Start from the per-pair objective, take the derivative, and solve.

ExerciseAdvanced

Problem

Word embeddings are known to encode societal biases present in training data (e.g., gender stereotypes in occupation words). Explain how this arises from the distributional hypothesis and the PMI factorization, and why debiasing by removing a "gender direction" from the embedding space is a mathematically principled but practically incomplete solution.

References

Linguistic foundations:

Harris, "Distributional Structure" (Word, 1954) --- the original distributional hypothesis
Firth, Studies in Linguistic Analysis (1957), "a word is characterized by the company it keeps"

Canonical:

Mikolov et al., "Efficient Estimation of Word Representations in Vector Space" (2013). arXiv:1301.3781
Mikolov et al., "Distributed Representations of Words and Phrases and their Compositionality" (NeurIPS 2013). arXiv:1310.4546
Pennington, Socher, Manning, "GloVe: Global Vectors for Word Representation" (EMNLP 2014)

Analysis and debiasing:

Levy & Goldberg, "Neural Word Embedding as Implicit Matrix Factorization" (NeurIPS 2014) --- the PMI factorization result
Bolukbasi et al., "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings" (NeurIPS 2016). arXiv:1607.06520
Gonen & Goldberg, "Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them" (NAACL 2019). arXiv:1903.03862

Next Topics

From word embeddings, the natural next steps are:

Attention mechanism theory: the mechanism that replaced static embeddings with contextual representations
Transformer architecture: the architecture that scaled contextual embeddings to modern LLMs

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Singular Value Decompositionlayer 0A · tier 1
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
Logistic Regressionlayer 1 · tier 1
Information Retrieval Foundationslayer 2 · tier 1

Derived topics

5

Natural Language Processing Foundationslayer 2 · tier 2
Semantic Search and Embeddingslayer 3 · tier 2
Attention Mechanism Theorylayer 4 · tier 2
Transformer Architecturelayer 4 · tier 2
NLP for Economic Text Analysislayer 4 · tier 3

Graph-backed continuations

Attention Mechanism Theory Transformer Architecture Natural Language Processing Foundations NLP for Economic Text Analysis Semantic Search and Embeddings