Self-Organizing Maps

Sneiderman, Robby

ML Methods

Self-Organizing Maps

Kohonen networks: competitive learning on a grid that produces topology-preserving mappings from high-dimensional input to low-dimensional discrete maps.

CoreTier 3StableSupporting~40 min

Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 2 | tier 3. This page has 0 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

t-SNE and UMAP

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

A self-organizing map (SOM) is an unsupervised learning algorithm that maps high-dimensional data onto a low-dimensional grid (usually 2D) while preserving the topology of the input space. Points that are close in the input space map to nearby neurons on the grid. Points that are far apart map to distant neurons.

SOMs were introduced by Kohonen in 1982 and remain useful for visualization and exploratory data analysis. They are a precursor to modern dimensionality reduction methods like t-SNE and UMAP. All three methods aim to preserve neighborhood structure, but they use different mathematical machinery.

Mental Model

Imagine a flexible elastic sheet with a grid of neurons on it. You throw the sheet into a high-dimensional data cloud. The SOM algorithm stretches and folds the sheet so that each neuron moves toward the data points it "captures," and nearby neurons on the sheet stay near each other in data space. After training, the sheet conforms to the shape of the data distribution while maintaining its grid topology.

Formal Setup

Let $\{x_1, \ldots, x_N\}$ be input data with $x_i \in \mathbb{R}^d$ . The SOM consists of $M$ neurons arranged on a grid (typically 2D), each with a weight vector $w_j \in \mathbb{R}^d$ for $j = 1, \ldots, M$ .

Definition

Best Matching Unit (BMU)

For an input $x$ , the best matching unit (BMU) is the neuron whose weight vector is closest to $x$ :

$\text{BMU}(x) = \arg\min_{j} \|x - w_j\|^2$

This is the competitive step: neurons compete to be the "winner" for each input. Only the winner and its neighbors are updated.

Definition

SOM Update Rule

After finding the BMU $j^*$ for input $x$ , update all weight vectors:

$w_j \leftarrow w_j + \eta(t) \cdot h(j, j^*, t) \cdot (x - w_j)$

where $\eta(t)$ is the learning rate (decreasing with iteration $t$ ), and $h(j, j^*, t)$ is the neighborhood function that decays with grid distance between neuron $j$ and the BMU $j^*$ . A common choice is a Gaussian:

$h(j, j^*, t) = \exp\!\left(-\frac{\|r_j - r_{j^*}\|^2}{2\sigma(t)^2}\right)$

where $r_j$ denotes the grid position of neuron $j$ and $\sigma(t)$ is the neighborhood radius, which shrinks over time.

The SOM Algorithm

Initialize weight vectors $w_j$ randomly (or from a PCA-based initialization)
For each epoch:
- For each input $x_i$ $x_{i}$ (presented in random order):
  - Find the BMU: $j^* = \arg\min_j \|x_i - w_j\|$
  - Update all neurons: $w_j \leftarrow w_j + \eta(t) \, h(j, j^*, t)(x_i - w_j)$
- Decrease $\eta(t)$ and $\sigma(t)$
Stop when the learning rate is sufficiently small

The neighborhood radius $\sigma(t)$ starts large (covering most of the grid) and shrinks to cover only the BMU itself. Early iterations organize the global structure. Late iterations refine local details.

Main Theorems

Proposition

Topology Preservation of SOMs

Statement

For a one-dimensional SOM (neurons on a line) with a continuous, connected input distribution, the weight vectors converge to a monotone ordering that preserves the topology of the input space. Formally, if $x < y$ in the input space, then $w_{\text{BMU}(x)}$ and $w_{\text{BMU}(y)}$ are ordered correspondingly on the grid. In the 1D case with $\sigma(t) \to 0$ and $\eta(t) \to 0$ appropriately, the SOM converges to a topology-preserving mapping with probability 1.

Intuition

The neighborhood function is the key mechanism. When the BMU updates, its neighbors on the grid are pulled in the same direction. This creates a "chain" effect: adjacent grid neurons are dragged toward nearby regions of input space. If neuron 5 captures data near $x$ and neuron 6 captures data near $y$ , the neighborhood coupling ensures that $w_5$ and $w_6$ remain close in data space, preserving the grid adjacency as spatial proximity.

Why It Matters

Topology preservation is what makes SOMs useful for visualization. When you color or label the grid neurons, spatially adjacent neurons represent similar data. Clusters appear as contiguous regions on the grid. Dissimilar data maps to distant grid positions. Without topology preservation, the grid would be a random assignment of data to neurons with no interpretable structure.

Failure Mode

Topology preservation is proven only for the 1D case. In 2D and higher, the SOM can develop "twists" or "folds" where the grid topology does not match the data topology. This happens when the neighborhood radius shrinks too quickly, freezing the map before global ordering is achieved. Large initial neighborhood radius and slow annealing help but do not guarantee perfect topology preservation in 2D.

report a correction →

Connection to Other Methods

K-means clustering: if the neighborhood radius $\sigma$ is set to zero (no neighborhood influence), the SOM reduces to online k-means: each input updates only the BMU. The topology-preserving property vanishes. SOMs are k-means with the additional constraint that cluster centers must respect a grid topology.

t-SNE and UMAP: like SOMs, these methods preserve neighborhood structure in a low-dimensional embedding. The mathematical approaches differ. SOMs use competitive learning with neighborhood functions. t-SNE minimizes KL divergence between high-dimensional and low-dimensional pairwise similarity distributions. UMAP uses cross-entropy on fuzzy simplicial sets. SOMs produce a discrete grid; t-SNE and UMAP produce continuous embeddings.

PCA: PCA finds linear projections that maximize variance. SOMs find nonlinear mappings that preserve topology. PCA is a spectral method; SOMs are a learning algorithm. PCA is optimal for Gaussian data; SOMs handle arbitrary manifold structure.

Practical Considerations

Grid size: too small and distinct clusters merge; too large and the map is sparse with many empty neurons. Common choices: $10 \times 10$ to $50 \times 50$ for moderate datasets.

Initialization: random initialization works but may require many iterations for global ordering. PCA-based initialization (aligning the grid with the first two principal components) converges faster and reduces the risk of topological defects.

U-matrix visualization: the U-matrix shows the distance between adjacent neuron weight vectors. Large distances indicate cluster boundaries. Small distances indicate within-cluster regions. This is the primary visualization tool for trained SOMs.

Watch Out

SOMs are not dimensionality reduction in the same sense as PCA

PCA computes a linear projection applicable to any new data point. SOMs learn a discrete mapping specific to the training data. To project a new point, you find its BMU, which is a nearest-neighbor lookup, not a linear transformation. SOMs do not provide a continuous mapping from input space to the grid.

Watch Out

The neighborhood function is not optional

Without the neighborhood function, the SOM degenerates to k-means. The neighborhood function is what creates the topological structure of the map. Setting the neighborhood radius too small too early produces a map with no spatial coherence.

Summary

SOMs map high-dimensional data to a low-dimensional grid via competitive learning with neighborhood updates
BMU selection is nearest-neighbor; the neighborhood function pulls nearby grid neurons toward similar data
Topology preservation is proven for 1D; in 2D it depends on annealing schedules and initialization
With zero neighborhood radius, SOMs reduce to online k-means
The U-matrix visualizes cluster boundaries on the trained grid
SOMs are a precursor to t-SNE and UMAP; all preserve neighborhood structure using different mathematics

Exercises

ExerciseCore

Problem

A SOM has a $10 \times 10$ grid (100 neurons) with weight vectors in $\mathbb{R}^{50}$ . For an input $x$ , the BMU is neuron $(3, 4)$ . The neighborhood function is Gaussian with $\sigma = 2$ . Compute the neighborhood weight $h$ for neuron $(5, 6)$ .

ExerciseAdvanced

Problem

Explain why a SOM trained with a constant (non-decreasing) neighborhood radius $\sigma$ fails to converge. What goes wrong, and what does the trained map look like?

References

Canonical:

Kohonen, Self-Organizing Maps (Springer, 3rd edition, 2001), Chapters 3-5
Kohonen, "Self-organized formation of topologically correct feature maps" (1982)
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. §14.4 (self-organizing maps, the batch SOM algorithm, the SOM as a constrained K-means, and the relationship to principal-component curves and surfaces).

Current:

Vesanto & Alhoniemi, "Clustering of the self-organizing map" (IEEE Transactions, 2000)

Next Topics

The natural next steps from self-organizing maps:

t-SNE and UMAP: modern nonlinear dimensionality reduction methods that generalize the neighborhood-preservation idea
K-means clustering: the degenerate case of SOMs without neighborhood coupling

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

0

No direct prerequisites are declared; this is treated as an entry point.

Derived topics

2

K-Means Clusteringlayer 1 · tier 1
t-SNE and UMAPlayer 2 · tier 2

Graph-backed continuations

t-SNE and UMAP K-Means Clustering