Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Self-Organizing Maps

Kohonen networks: competitive learning on a grid that produces topology-preserving mappings from high-dimensional input to low-dimensional discrete maps.

CoreTier 3Stable~40 min
0

Why This Matters

A self-organizing map (SOM) is an unsupervised learning algorithm that maps high-dimensional data onto a low-dimensional grid (usually 2D) while preserving the topology of the input space. Points that are close in the input space map to nearby neurons on the grid. Points that are far apart map to distant neurons.

SOMs were introduced by Kohonen in 1982 and remain useful for visualization and exploratory data analysis. They are a precursor to modern dimensionality reduction methods like t-SNE and UMAP. All three methods aim to preserve neighborhood structure, but they use different mathematical machinery.

Mental Model

Imagine a flexible elastic sheet with a grid of neurons on it. You throw the sheet into a high-dimensional data cloud. The SOM algorithm stretches and folds the sheet so that each neuron moves toward the data points it "captures," and nearby neurons on the sheet stay near each other in data space. After training, the sheet conforms to the shape of the data distribution while maintaining its grid topology.

Formal Setup

Let {x1,,xN}\{x_1, \ldots, x_N\} be input data with xiRdx_i \in \mathbb{R}^d. The SOM consists of MM neurons arranged on a grid (typically 2D), each with a weight vector wjRdw_j \in \mathbb{R}^d for j=1,,Mj = 1, \ldots, M.

Definition

Best Matching Unit (BMU)

For an input xx, the best matching unit (BMU) is the neuron whose weight vector is closest to xx:

BMU(x)=argminjxwj2\text{BMU}(x) = \arg\min_{j} \|x - w_j\|^2

This is the competitive step: neurons compete to be the "winner" for each input. Only the winner and its neighbors are updated.

Definition

SOM Update Rule

After finding the BMU jj^* for input xx, update all weight vectors:

wjwj+η(t)h(j,j,t)(xwj)w_j \leftarrow w_j + \eta(t) \cdot h(j, j^*, t) \cdot (x - w_j)

where η(t)\eta(t) is the learning rate (decreasing with iteration tt), and h(j,j,t)h(j, j^*, t) is the neighborhood function that decays with grid distance between neuron jj and the BMU jj^*. A common choice is a Gaussian:

h(j,j,t)=exp ⁣(rjrj22σ(t)2)h(j, j^*, t) = \exp\!\left(-\frac{\|r_j - r_{j^*}\|^2}{2\sigma(t)^2}\right)

where rjr_j denotes the grid position of neuron jj and σ(t)\sigma(t) is the neighborhood radius, which shrinks over time.

The SOM Algorithm

  1. Initialize weight vectors wjw_j randomly (or from a PCA-based initialization)
  2. For each epoch:
    • For each input xix_i (presented in random order):
      • Find the BMU: j=argminjxiwjj^* = \arg\min_j \|x_i - w_j\|
      • Update all neurons: wjwj+η(t)h(j,j,t)(xiwj)w_j \leftarrow w_j + \eta(t) \, h(j, j^*, t)(x_i - w_j)
    • Decrease η(t)\eta(t) and σ(t)\sigma(t)
  3. Stop when the learning rate is sufficiently small

The neighborhood radius σ(t)\sigma(t) starts large (covering most of the grid) and shrinks to cover only the BMU itself. Early iterations organize the global structure. Late iterations refine local details.

Main Theorems

Proposition

Topology Preservation of SOMs

Statement

For a one-dimensional SOM (neurons on a line) with a continuous, connected input distribution, the weight vectors converge to a monotone ordering that preserves the topology of the input space. Formally, if x<yx < y in the input space, then wBMU(x)w_{\text{BMU}(x)} and wBMU(y)w_{\text{BMU}(y)} are ordered correspondingly on the grid. In the 1D case with σ(t)0\sigma(t) \to 0 and η(t)0\eta(t) \to 0 appropriately, the SOM converges to a topology-preserving mapping with probability 1.

Intuition

The neighborhood function is the key mechanism. When the BMU updates, its neighbors on the grid are pulled in the same direction. This creates a "chain" effect: adjacent grid neurons are dragged toward nearby regions of input space. If neuron 5 captures data near xx and neuron 6 captures data near yy, the neighborhood coupling ensures that w5w_5 and w6w_6 remain close in data space, preserving the grid adjacency as spatial proximity.

Why It Matters

Topology preservation is what makes SOMs useful for visualization. When you color or label the grid neurons, spatially adjacent neurons represent similar data. Clusters appear as contiguous regions on the grid. Dissimilar data maps to distant grid positions. Without topology preservation, the grid would be a random assignment of data to neurons with no interpretable structure.

Failure Mode

Topology preservation is proven only for the 1D case. In 2D and higher, the SOM can develop "twists" or "folds" where the grid topology does not match the data topology. This happens when the neighborhood radius shrinks too quickly, freezing the map before global ordering is achieved. Large initial neighborhood radius and slow annealing help but do not guarantee perfect topology preservation in 2D.

Connection to Other Methods

K-means clustering: if the neighborhood radius σ\sigma is set to zero (no neighborhood influence), the SOM reduces to online k-means: each input updates only the BMU. The topology-preserving property vanishes. SOMs are k-means with the additional constraint that cluster centers must respect a grid topology.

t-SNE and UMAP: like SOMs, these methods preserve neighborhood structure in a low-dimensional embedding. The mathematical approaches differ. SOMs use competitive learning with neighborhood functions. t-SNE minimizes KL divergence between high-dimensional and low-dimensional pairwise similarity distributions. UMAP uses cross-entropy on fuzzy simplicial sets. SOMs produce a discrete grid; t-SNE and UMAP produce continuous embeddings.

PCA: PCA finds linear projections that maximize variance. SOMs find nonlinear mappings that preserve topology. PCA is a spectral method; SOMs are a learning algorithm. PCA is optimal for Gaussian data; SOMs handle arbitrary manifold structure.

Practical Considerations

Grid size: too small and distinct clusters merge; too large and the map is sparse with many empty neurons. Common choices: 10×1010 \times 10 to 50×5050 \times 50 for moderate datasets.

Initialization: random initialization works but may require many iterations for global ordering. PCA-based initialization (aligning the grid with the first two principal components) converges faster and reduces the risk of topological defects.

U-matrix visualization: the U-matrix shows the distance between adjacent neuron weight vectors. Large distances indicate cluster boundaries. Small distances indicate within-cluster regions. This is the primary visualization tool for trained SOMs.

Watch Out

SOMs are not dimensionality reduction in the same sense as PCA

PCA computes a linear projection applicable to any new data point. SOMs learn a discrete mapping specific to the training data. To project a new point, you find its BMU, which is a nearest-neighbor lookup, not a linear transformation. SOMs do not provide a continuous mapping from input space to the grid.

Watch Out

The neighborhood function is not optional

Without the neighborhood function, the SOM degenerates to k-means. The neighborhood function is what creates the topological structure of the map. Setting the neighborhood radius too small too early produces a map with no spatial coherence.

Summary

  • SOMs map high-dimensional data to a low-dimensional grid via competitive learning with neighborhood updates
  • BMU selection is nearest-neighbor; the neighborhood function pulls nearby grid neurons toward similar data
  • Topology preservation is proven for 1D; in 2D it depends on annealing schedules and initialization
  • With zero neighborhood radius, SOMs reduce to online k-means
  • The U-matrix visualizes cluster boundaries on the trained grid
  • SOMs are a precursor to t-SNE and UMAP; all preserve neighborhood structure using different mathematics

Exercises

ExerciseCore

Problem

A SOM has a 10×1010 \times 10 grid (100 neurons) with weight vectors in R50\mathbb{R}^{50}. For an input xx, the BMU is neuron (3,4)(3, 4). The neighborhood function is Gaussian with σ=2\sigma = 2. Compute the neighborhood weight hh for neuron (5,6)(5, 6).

ExerciseAdvanced

Problem

Explain why a SOM trained with a constant (non-decreasing) neighborhood radius σ\sigma fails to converge. What goes wrong, and what does the trained map look like?

References

Canonical:

  • Kohonen, Self-Organizing Maps (Springer, 3rd edition, 2001), Chapters 3-5
  • Kohonen, "Self-organized formation of topologically correct feature maps" (1982)

Current:

  • Vesanto & Alhoniemi, "Clustering of the self-organizing map" (IEEE Transactions, 2000)

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

  • Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 1-28

Next Topics

The natural next steps from self-organizing maps:

  • t-SNE and UMAP: modern nonlinear dimensionality reduction methods that generalize the neighborhood-preservation idea
  • K-means clustering: the degenerate case of SOMs without neighborhood coupling

Last reviewed: April 2026

Next Topics