ML Methods
Word Embeddings
Dense vector representations of words: Word2Vec (skip-gram, CBOW), negative sampling, GloVe, the distributional hypothesis, and why embeddings transformed NLP from sparse features to learned representations.
Why This Matters
Before word embeddings, NLP systems represented words as one-hot vectors --- sparse, high-dimensional, and devoid of semantic content. The word "king" and the word "queen" were as far apart as "king" and "refrigerator." Every NLP model had to discover semantic relationships from scratch.
Word embeddings changed this by learning dense, low-dimensional representations where semantic similarity is captured by geometric proximity. They showed that unsupervised learning on raw text could produce representations encoding rich linguistic structure --- a finding that foreshadowed the representation learning revolution leading to transformers and large language models.
Mental Model
The distributional hypothesis is the foundation: a word is characterized by the company it keeps. Words that appear in similar contexts should have similar representations. This core idea belongs to the field of distributional semantics, which formalizes meaning through co-occurrence statistics.
Word embedding methods operationalize this idea: train a model to predict context from words (or vice versa), and use the learned parameters as word representations. The prediction task is a pretext --- the real goal is the representations that emerge as a byproduct.
Think of each word as a point in (typically to ). Training positions the points so that words appearing in similar contexts end up nearby. The resulting geometry captures semantic relationships: directions in the embedding space correspond to semantic relationships like gender, tense, and plurality.
Word2Vec: Skip-Gram and CBOW
Skip-Gram Model
The skip-gram model predicts context words from a center word. Given a center word and a context window of size , the model maximizes the probability of observing the context words (excluding ):
where is the corpus size and the conditional probability uses the softmax over the full vocabulary :
Here is the center embedding and is the context embedding for word . Each word has two vectors; typically the center embeddings are used as the final word representations.
CBOW (Continuous Bag of Words)
CBOW predicts the center word from its context. Given context words (excluding ), the model predicts :
where is the average of the context word embeddings.
CBOW is faster to train than skip-gram because it makes one prediction per context window instead of predictions. Skip-gram tends to perform better on rare words because each occurrence produces multiple training examples.
Negative Sampling
The softmax denominator sums over the entire vocabulary, which can be hundreds of thousands of words. Computing this for every training step is prohibitively expensive.
Negative Sampling
Negative sampling replaces the full softmax objective with a binary classification task. For each positive pair (center word and actual context word), sample negative words from a noise distribution (typically ). The objective becomes:
where is the sigmoid function. This maximizes the dot product between the center word and its actual context while minimizing it with random negative samples.
Typical values: to for small datasets, to for large datasets. The power in the noise distribution upweights rare words relative to their frequency, preventing common words from dominating.
Negative sampling is not just a computational trick. Levy and Goldberg (2014) showed that the skip-gram with negative sampling objective implicitly factorizes the pointwise mutual information (PMI) matrix of word co-occurrences, shifted by . This connects the prediction-based approach of Word2Vec to the count-based approach of GloVe.
GloVe: Global Vectors
GloVe (Global Vectors for Word Representation)
GloVe directly factorizes the word co-occurrence matrix. Let be the number of times word appears in the context of word . GloVe minimizes:
where are word and context vectors, are bias terms, and is a weighting function that downweights very frequent co-occurrences:
with typical values , .
GloVe uses the global co-occurrence statistics of the corpus (the full matrix ), while Word2Vec operates on local context windows. In practice, both methods produce embeddings of similar quality.
Main Theorems
Skip-Gram with Negative Sampling Factorizes the PMI Matrix
Statement
The skip-gram model with negative sampling (SGNS), in the limit of a large corpus, has the property that at the global optimum:
where is the pointwise mutual information of the word-context pair, and is the number of negative samples.
In matrix form: the optimal embeddings satisfy where .
Intuition
The skip-gram model is implicitly performing a matrix factorization of the PMI matrix, shifted by a constant that depends on the number of negative samples. Words that co-occur more often than expected by chance (positive PMI) get embeddings with positive dot product. Words that co-occur less often than expected (negative PMI) get embeddings with negative dot product.
This connection unifies the "prediction-based" view (Word2Vec) with the "count-based" view (GloVe, PMI): both are doing matrix factorization of co-occurrence statistics, just with different weighting schemes.
Proof Sketch
For a fixed word-context pair , the SGNS objective for that pair (averaged over the corpus and negative samples) is:
Setting the derivative with respect to to zero:
Using (unigram distribution) and solving:
Why It Matters
This result provides theoretical grounding for why Word2Vec works. It is not learning arbitrary correlations --- it is recovering the statistical structure of word co-occurrences in a low-rank form. The PMI matrix captures the essential information about word relationships, and the embedding dimension controls the rank of the approximation.
Failure Mode
The proof assumes the embedding dimension is large enough to exactly represent the PMI matrix, which it is not in practice ( while ). In practice, SGNS finds a low-rank approximation to the shifted PMI matrix. Also, the noise distribution differs from the unigram distribution, which modifies the exact factorization relationship.
Word Analogies and Embedding Geometry
The most striking property of word embeddings is that they encode semantic relationships as vector arithmetic:
This works because the direction captures the "royalty" component, and adding it to yields the female royal. More generally, if a semantic relationship maps word to word and word to word , then .
This is not a built-in feature --- it emerges from the training process. The explanation: if the PMI matrix has the structure that shifting from "man" to "woman" affects co-occurrence patterns the same way as shifting from "king" to "queen," then the optimal embeddings will encode this shift as a consistent vector direction.
Canonical Examples
Analogy completion
Given the query "man:king :: woman:?", find the word that maximizes:
On standard embedding models trained on large corpora, this returns "queen" with high accuracy. Other analogy types that work: Paris:France :: Berlin:? (Germany), walking:walked :: swimming:? (swam).
Analogies work best for frequent words with clear relationships. They fail for rare words, ambiguous words, and relationships that do not correspond to linear subspaces in the embedding space.
Nearest neighbors reveal semantic clusters
The nearest neighbors of "python" (by cosine similarity) in a Word2Vec model trained on Wikipedia include both the programming language sense (java, perl, ruby) and the animal sense (snake, reptile, viper). This illustrates a fundamental limitation of static word embeddings: each word gets a single vector regardless of the number of senses. Contextual embeddings (ELMo, BERT, GPT) address this by computing a different representation for each occurrence.
Common Confusions
Word2Vec is not deep learning
Word2Vec is a shallow model: one hidden layer for skip-gram, no hidden layers for CBOW (just an averaging operation). It does not use backpropagation through multiple layers. Its power comes from the training objective and the massive scale of data, not from depth. The terminology "neural word embeddings" is somewhat misleading.
Skip-gram and CBOW learn different things
Skip-gram predicts context from center word; CBOW predicts center word from context. They are not equivalent. Skip-gram gives each word-context pair equal weight, which helps rare words. CBOW averages context vectors, which smooths noise and trains faster. For most applications, skip-gram with negative sampling is the default choice.
Static embeddings have been largely superseded
Word2Vec and GloVe produce static embeddings: one vector per word type, regardless of context. Modern NLP uses contextual embeddings from transformers (BERT, GPT), where each word token gets a context-dependent representation. Static embeddings are still useful as a conceptual foundation and for resource-constrained settings. They were state-of-the-art for most NLP tasks in the pre-transformer era (2013-2017), displaced by BERT (Devlin et al. 2019) and subsequent contextual embeddings.
Summary
- Distributional hypothesis: words are defined by their context
- Skip-gram: predict context words from center word
- CBOW: predict center word from averaged context vectors
- Negative sampling approximates the softmax; typically to negatives
- SGNS implicitly factorizes the shifted PMI matrix:
- GloVe directly factorizes the co-occurrence matrix with weighted least squares
- Word analogies () emerge from the geometry
- Static embeddings give one vector per word type --- no sense disambiguation
- Word embeddings were the bridge from sparse NLP to representation learning
Exercises
Problem
In the skip-gram model with vocabulary size and embedding dimension , how many parameters does the model have? Why is the full softmax prohibitively expensive, and how does negative sampling with reduce the cost per training example?
Problem
Show that the SGNS objective at its optimum satisfies . Start from the per-pair objective, take the derivative, and solve.
Problem
Word embeddings are known to encode societal biases present in training data (e.g., gender stereotypes in occupation words). Explain how this arises from the distributional hypothesis and the PMI factorization, and why debiasing by removing a "gender direction" from the embedding space is a mathematically principled but practically incomplete solution.
References
Canonical:
- Mikolov et al., "Efficient Estimation of Word Representations in Vector Space" (2013) --- Word2Vec
- Mikolov et al., "Distributed Representations of Words and Phrases and their Compositionality" (NeurIPS 2013) --- negative sampling
- Pennington, Socher, Manning, "GloVe: Global Vectors for Word Representation" (EMNLP 2014)
Current:
- Levy & Goldberg, "Neural Word Embedding as Implicit Matrix Factorization" (NeurIPS 2014) --- the PMI factorization result
- Bolukbasi et al., "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings" (NeurIPS 2016)
Next Topics
From word embeddings, the natural next steps are:
- Attention mechanism theory: the mechanism that replaced static embeddings with contextual representations
- Transformer architecture: the architecture that scaled contextual embeddings to modern LLMs
Last reviewed: April 2026