Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

NMF (Nonnegative Matrix Factorization)

Factor V into W*H with all entries nonnegative: parts-based additive representation, multiplicative update rules, and applications to topic modeling and image decomposition.

CoreTier 3Stable~40 min
0

Why This Matters

Given a data matrix VR0n×pV \in \mathbb{R}_{\geq 0}^{n \times p} where all entries are nonnegative (e.g., word counts, pixel intensities, gene expression levels), NMF finds a low-rank factorization VWHV \approx WH where both factors are also nonnegative: WR0n×rW \in \mathbb{R}_{\geq 0}^{n \times r} and HR0r×pH \in \mathbb{R}_{\geq 0}^{r \times p}.

The nonnegativity constraint changes the nature of the factorization. PCA allows negative values, so principal components can cancel each other: a face might be represented as "average face minus nose plus eyes." NMF forces an additive, parts-based representation: a face is represented as "some amount of nose + some amount of eyes + some amount of mouth." Each component adds to the reconstruction; nothing subtracts.

This parts-based interpretation makes NMF particularly useful for topic modeling (each topic is a nonnegative combination of words), image analysis (each component is a recognizable part), and genomics (each factor represents a biological process with nonnegative expression).

Mental Model

Think of NMF as decomposing a recipe into ingredients. A dish is a nonnegative combination of ingredients (you cannot add negative flour). Each column of HH specifies an "ingredient" (a basis pattern), and each row of WW specifies how much of each ingredient goes into each observation. Every observation is built by adding ingredients together, never subtracting.

Formal Setup

Definition

Nonnegative Matrix Factorization

Given VR0n×pV \in \mathbb{R}_{\geq 0}^{n \times p} and a target rank rr, NMF finds:

minW,H0VWHF2\min_{W, H \geq 0} \|V - WH\|_F^2

where WR0n×rW \in \mathbb{R}_{\geq 0}^{n \times r}, HR0r×pH \in \mathbb{R}_{\geq 0}^{r \times p}, and F\|\cdot\|_F is the Frobenius norm. The constraint W,H0W, H \geq 0 means all entries are nonnegative.

Definition

Divergence-Based NMF

An alternative objective uses generalized KL divergence instead of squared error:

DKL(VWH)=ij(VijlogVij(WH)ijVij+(WH)ij)D_{\text{KL}}(V \| WH) = \sum_{ij} \left( V_{ij} \log \frac{V_{ij}}{(WH)_{ij}} - V_{ij} + (WH)_{ij} \right)

This objective is appropriate when the data has Poisson-like statistics (e.g., count data). The choice of objective affects the update rules but not the structural properties of the factorization.

Multiplicative Update Rules

Lee and Seung (1999, 2001) introduced multiplicative update rules that maintain nonnegativity automatically. For the Frobenius norm objective:

WikWik(VH)ik(WHH)ik,HkjHkj(WV)kj(WWH)kjW_{ik} \leftarrow W_{ik} \cdot \frac{(VH^\top)_{ik}}{(WHH^\top)_{ik}}, \quad H_{kj} \leftarrow H_{kj} \cdot \frac{(W^\top V)_{kj}}{(W^\top WH)_{kj}}

These updates preserve nonnegativity automatically: since VV, WW, and HH are all nonnegative, the numerators and denominators are nonnegative. Multiplying a nonnegative value by a nonnegative ratio keeps it nonnegative. No projection step is needed.

Main Theorems

Theorem

Monotonic Convergence of Multiplicative Updates

Statement

The multiplicative update rules for WW and HH are non-increasing under the Frobenius norm objective:

VW(t+1)H(t+1)F2VW(t)H(t)F2\|V - W^{(t+1)}H^{(t+1)}\|_F^2 \leq \|V - W^{(t)}H^{(t)}\|_F^2

for every iteration tt. The objective converges to a fixed point of the update equations. Every limit point of the sequence (W(t),H(t))(W^{(t)}, H^{(t)}) is a stationary point of the constrained optimization problem.

Intuition

Each multiplicative update can be derived as a majorization-minimization (MM) step. The update constructs an auxiliary function that upper-bounds the objective and touches it at the current point. Minimizing the auxiliary function (which has a closed-form solution) is guaranteed to decrease the original objective. The multiplicative form arises from the specific choice of auxiliary function.

Proof Sketch

Define the auxiliary function G(H,H(t))G(H, H^{(t)}) that satisfies G(H,H(t))VWHF2G(H, H^{(t)}) \geq \|V - WH\|_F^2 with equality when H=H(t)H = H^{(t)}. The auxiliary function is constructed using the inequality ABkAikBkjHkj(t)/Hkj(t)AB \leq \sum_k A_{ik} B_{kj} H^{(t)}_{kj} / H^{(t)}_{kj} (a bound from the concavity of the log). Minimizing GG over HH gives the multiplicative update rule. Monotonic decrease of the objective follows from the MM framework.

Why It Matters

Monotonic convergence guarantees that the algorithm does not oscillate or diverge. Combined with the automatic maintenance of nonnegativity, the multiplicative updates make NMF simple to implement and reliable in practice. No step-size tuning is required (unlike gradient descent).

Failure Mode

Convergence is to a stationary point, not a global minimum. The NMF objective is nonconvex (bilinear in WW and HH), so different initializations lead to different solutions. Multiple restarts with random initialization are standard practice. Additionally, if any entry of WW or HH reaches exactly zero, it stays at zero forever (the multiplicative update cannot escape zero). This can cause premature convergence to suboptimal solutions.

NMF vs. PCA

PropertyPCANMF
Sign constraintNone (components can be negative)All nonnegative
InterpretationOrthogonal directions of varianceAdditive parts
UniquenessUnique (up to sign/rotation)Not unique in general
ObjectiveMaximize variance (eigenvalue problem)Minimize reconstruction error (iterative)
ComponentsOrthogonalNot necessarily orthogonal
CancellationComponents can cancel each otherComponents only add

The key difference: PCA finds directions of maximum variance and produces components that are globally ordered by explained variance. NMF finds nonnegative parts that are locally interpretable. PCA components often mix positive and negative contributions, making individual components hard to interpret. NMF components are always additive and typically correspond to recognizable parts or patterns.

Applications

Topic modeling: VV is a document-term matrix (documents ×\times words). WW is a document-topic matrix. HH is a topic-word matrix. Each topic (row of HH) is a nonnegative distribution over words. Each document (row of WW) is a nonnegative mixture of topics. NMF discovers topics without negative word weights, producing directly interpretable results.

Image decomposition: VV is a matrix of face images (each row is a flattened image). NMF learns parts: noses, eyes, mouths, and other facial features as nonnegative basis images. Reconstruction is additive overlay of these parts.

Gene expression: VV is a genes-by-samples matrix. NMF identifies metagenes (groups of co-expressed genes) and their activation levels across samples. The nonnegativity constraint is natural because gene expression levels are nonnegative.

Watch Out

NMF does not always produce parts-based representations

The parts-based property depends on the data and the rank rr. If rr is too large, the factors may not be interpretable. If the data does not have a natural parts-based structure, NMF will still factorize it, but the components may not correspond to meaningful parts. The parts-based interpretation is an empirical observation, not a guarantee.

Watch Out

NMF solutions are not unique

Unlike PCA (where eigenvectors are unique up to sign), NMF solutions depend on initialization. For any solution (W,H)(W, H), the pair (WD,D1H)(WD, D^{-1}H) is also a solution for any nonnegative diagonal matrix DD. Additional constraints (sparsity, orthogonality) can improve uniqueness, but the basic NMF problem has multiple global optima.

Summary

  • NMF factors VWHV \approx WH with all entries nonnegative, producing additive parts-based representations
  • Multiplicative update rules maintain nonnegativity automatically and converge monotonically
  • The objective is nonconvex; different initializations give different solutions
  • NMF components only add, never subtract; PCA components can cancel
  • Applications: topic modeling, image parts decomposition, gene expression analysis
  • If an entry of WW or HH reaches zero, it stays at zero permanently

Exercises

ExerciseCore

Problem

A document-term matrix VV has 1000 documents and 5000 words. You run NMF with rank r=20r = 20. What are the dimensions of WW and HH, and what does each row of HH represent?

ExerciseAdvanced

Problem

The multiplicative update for HH is HkjHkj(WV)kj/(WWH)kjH_{kj} \leftarrow H_{kj} \cdot (W^\top V)_{kj} / (W^\top W H)_{kj}. Show that if WW is held fixed, this update is equivalent to a scaled gradient descent step with a specific step size. What is the step size?

ExerciseResearch

Problem

Standard NMF is nonconvex and has multiple local minima. Under what conditions on VV and rr is the NMF factorization unique (up to scaling and permutation)? State the separability condition and explain its geometric meaning.

References

Canonical:

  • Lee & Seung, "Learning the parts of objects by non-negative matrix factorization" (Nature, 1999)
  • Lee & Seung, "Algorithms for Non-negative Matrix Factorization" (NeurIPS, 2001)

Current:

  • Gillis, Nonnegative Matrix Factorization (SIAM, 2020)

  • Arora et al., "Computing a Nonnegative Matrix Factorization -- Provably" (STOC, 2012)

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 3-15

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

Next Topics

The natural next steps from NMF:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics