NMF (Nonnegative Matrix Factorization)

Sneiderman, Robby

ML Methods

NMF (Nonnegative Matrix Factorization)

Factor V into W*H with all entries nonnegative: parts-based additive representation, multiplicative update rules, and applications to topic modeling and image decomposition.

CoreTier 3StableSupporting~40 min

Prerequisites

Eigenvalues and Eigenvectors

Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 2 | tier 3. This page has 1 direct prerequisite and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Principal Component Analysis

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Given a data matrix $V \in \mathbb{R}_{\geq 0}^{n \times p}$ where all entries are nonnegative (e.g., word counts, pixel intensities, gene expression levels), NMF finds a low-rank factorization $V \approx WH$ where both factors are also nonnegative: $W \in \mathbb{R}_{\geq 0}^{n \times r}$ and $H \in \mathbb{R}_{\geq 0}^{r \times p}$ .

The nonnegativity constraint changes the nature of the factorization. PCA allows negative values, so principal components can cancel each other: a face might be represented as "average face minus nose plus eyes." NMF forces an additive, parts-based representation: a face is represented as "some amount of nose + some amount of eyes + some amount of mouth." Each component adds to the reconstruction; nothing subtracts.

This parts-based interpretation makes NMF particularly useful for topic modeling (each topic is a nonnegative combination of words), image analysis (each component is a recognizable part), and genomics (each factor represents a biological process with nonnegative expression).

Mental Model

NMF approximates each observation as a nonnegative combination of $r$ basis patterns. Rows of $W \in \mathbb{R}_{\geq 0}^{n \times r}$ hold per-observation coefficients; rows of $H \in \mathbb{R}_{\geq 0}^{r \times p}$ are the basis patterns themselves. The nonnegativity constraint is the entire content of the model: it forces additive composition and rules out cancellation between patterns.

Formal Setup

Definition

Nonnegative Matrix Factorization

Given $V \in \mathbb{R}_{\geq 0}^{n \times p}$ and a target rank $r$ , NMF finds:

$\min_{W, H \geq 0} \|V - WH\|_F^2$

where $W \in \mathbb{R}_{\geq 0}^{n \times r}$ , $H \in \mathbb{R}_{\geq 0}^{r \times p}$ , and $\|\cdot\|_F$ is the Frobenius norm. The constraint $W, H \geq 0$ means all entries are nonnegative.

Definition

Divergence-Based NMF

An alternative objective uses generalized KL divergence instead of squared error:

$D_{\text{KL}}(V \| WH) = \sum_{ij} \left( V_{ij} \log \frac{V_{ij}}{(WH)_{ij}} - V_{ij} + (WH)_{ij} \right)$

This objective is appropriate when the data has Poisson-like statistics (e.g., count data). The choice of objective affects the update rules but not the structural properties of the factorization.

Multiplicative Update Rules

Lee and Seung (1999, 2001) introduced multiplicative update rules that maintain nonnegativity automatically. For the Frobenius norm objective:

$W_{ik} \leftarrow W_{ik} \cdot \frac{(VH^\top)_{ik}}{(WHH^\top)_{ik}}, \quad H_{kj} \leftarrow H_{kj} \cdot \frac{(W^\top V)_{kj}}{(W^\top WH)_{kj}}$

These updates preserve nonnegativity automatically: since $V$ , $W$ , and $H$ are all nonnegative, the numerators and denominators are nonnegative. Multiplying a nonnegative value by a nonnegative ratio keeps it nonnegative. No projection step is needed.

Main Theorems

Theorem

Monotonic Convergence of Multiplicative Updates

Statement

The multiplicative update rules for $W$ and $H$ are non-increasing under the Frobenius norm objective:

$\|V - W^{(t+1)}H^{(t+1)}\|_F^2 \leq \|V - W^{(t)}H^{(t)}\|_F^2$

for every iteration $t$ . The objective converges to a fixed point of the update equations. Every limit point of the sequence $(W^{(t)}, H^{(t)})$ is a stationary point of the constrained optimization problem.

Intuition

Each multiplicative update can be derived as a majorization-minimization (MM) step. The update constructs an auxiliary function that upper-bounds the objective and touches it at the current point. Minimizing the auxiliary function (which has a closed-form solution) is guaranteed to decrease the original objective. The multiplicative form arises from the specific choice of auxiliary function.

Proof Sketch

Define the auxiliary function $G(H, H^{(t)})$ that satisfies $G(H, H^{(t)}) \geq \|V - WH\|_F^2$ with equality when $H = H^{(t)}$ . The auxiliary function is constructed using the inequality $AB \leq \sum_k A_{ik} B_{kj} H^{(t)}_{kj} / H^{(t)}_{kj}$ (a bound from the concavity of the log). Minimizing $G$ over $H$ gives the multiplicative update rule. Monotonic decrease of the objective follows from the MM framework.

Why It Matters

Monotonic convergence guarantees that the algorithm does not oscillate or diverge. Combined with the automatic maintenance of nonnegativity, the multiplicative updates make NMF simple to implement and reliable in practice. No step-size tuning is required (unlike gradient descent).

Failure Mode

Convergence is to a stationary point, not a global minimum. The NMF objective is nonconvex (bilinear in $W$ and $H$ ), so different initializations lead to different solutions. Multiple restarts with random initialization are standard practice. Additionally, if any entry of $W$ or $H$ reaches exactly zero, it stays at zero forever (the multiplicative update cannot escape zero). This can cause premature convergence to suboptimal solutions.

report a correction →

NMF vs. PCA

Property	PCA	NMF
Sign constraint	None (components can be negative)	All nonnegative
Interpretation	Orthogonal directions of variance	Additive parts
Uniqueness	Unique (up to sign/rotation)	Not unique in general
Objective	Maximize variance (eigenvalue problem)	Minimize reconstruction error (iterative)
Components	Orthogonal	Not necessarily orthogonal
Cancellation	Components can cancel each other	Components only add

The key difference: PCA finds directions of maximum variance and produces components that are globally ordered by explained variance. NMF finds nonnegative parts that are locally interpretable. PCA components often mix positive and negative contributions, making individual components hard to interpret. NMF components are always additive and typically correspond to recognizable parts or patterns.

Applications

Topic modeling: $V$ is a document-term matrix (documents $\times$ words). $W$ is a document-topic matrix. $H$ is a topic-word matrix. Each topic (row of $H$ ) is a nonnegative distribution over words. Each document (row of $W$ ) is a nonnegative mixture of topics. NMF discovers topics without negative word weights, producing directly interpretable results.

Image decomposition: $V$ is a matrix of face images (each row is a flattened image). NMF learns parts: noses, eyes, mouths, and other facial features as nonnegative basis images. Reconstruction is additive overlay of these parts.

Gene expression: $V$ is a genes-by-samples matrix. NMF identifies metagenes (groups of co-expressed genes) and their activation levels across samples. The nonnegativity constraint is natural because gene expression levels are nonnegative.

Watch Out

NMF does not always produce parts-based representations

The parts-based property depends on the data and the rank $r$ . If $r$ is too large, the factors may not be interpretable. If the data does not have a natural parts-based structure, NMF will still factorize it, but the components may not correspond to meaningful parts. The parts-based interpretation is an empirical observation, not a guarantee.

Watch Out

NMF solutions are not unique

Unlike PCA (where eigenvectors are unique up to sign), NMF solutions depend on initialization. For any solution $(W, H)$ , the pair $(WD, D^{-1}H)$ is also a solution for any nonnegative diagonal matrix $D$ . Additional constraints (sparsity, orthogonality) can improve uniqueness, but the basic NMF problem has multiple global optima.

Summary

NMF factors $V \approx WH$ with all entries nonnegative, producing additive parts-based representations
Multiplicative update rules maintain nonnegativity automatically and converge monotonically
The objective is nonconvex; different initializations give different solutions
NMF components only add, never subtract; PCA components can cancel
Applications: topic modeling, image parts decomposition, gene expression analysis
If an entry of $W$ or $H$ reaches zero, it stays at zero permanently

Exercises

ExerciseCore

Problem

A document-term matrix $V$ has 1000 documents and 5000 words. You run NMF with rank $r = 20$ . What are the dimensions of $W$ and $H$ , and what does each row of $H$ represent?

ExerciseAdvanced

Problem

The multiplicative update for $H$ is $H_{kj} \leftarrow H_{kj} \cdot (W^\top V)_{kj} / (W^\top W H)_{kj}$ . Show that if $W$ is held fixed, this update is equivalent to a scaled gradient descent step with a specific step size. What is the step size?

ExerciseResearch

Problem

Standard NMF is nonconvex and has multiple local minima. Under what conditions on $V$ and $r$ is the NMF factorization unique (up to scaling and permutation)? State the separability condition and explain its geometric meaning.

References

Canonical:

Lee & Seung, "Learning the parts of objects by non-negative matrix factorization" (Nature, 1999)
Lee & Seung, "Algorithms for Non-negative Matrix Factorization" (NeurIPS, 2001)

Current:

Gillis, Nonnegative Matrix Factorization (SIAM, 2020)
Arora et al., "Computing a Nonnegative Matrix Factorization -- Provably" (STOC, 2012)
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 14.6

Next Topics

The natural next steps from NMF:

Principal component analysis: the unconstrained analog of NMF, allowing negative values and providing orthogonal components
K-means clustering: NMF with additional constraints (binary $W$ ) reduces to k-means

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Eigenvalues and Eigenvectorslayer 0A · tier 1

Derived topics

2

K-Means Clusteringlayer 1 · tier 1
Principal Component Analysislayer 1 · tier 1

Graph-backed continuations

Principal Component Analysis K-Means Clustering