ML Methods
Self-Organizing Maps
Kohonen networks: competitive learning on a grid that produces topology-preserving mappings from high-dimensional input to low-dimensional discrete maps.
Why This Matters
A self-organizing map (SOM) is an unsupervised learning algorithm that maps high-dimensional data onto a low-dimensional grid (usually 2D) while preserving the topology of the input space. Points that are close in the input space map to nearby neurons on the grid. Points that are far apart map to distant neurons.
SOMs were introduced by Kohonen in 1982 and remain useful for visualization and exploratory data analysis. They are a precursor to modern dimensionality reduction methods like t-SNE and UMAP. All three methods aim to preserve neighborhood structure, but they use different mathematical machinery.
Mental Model
Imagine a flexible elastic sheet with a grid of neurons on it. You throw the sheet into a high-dimensional data cloud. The SOM algorithm stretches and folds the sheet so that each neuron moves toward the data points it "captures," and nearby neurons on the sheet stay near each other in data space. After training, the sheet conforms to the shape of the data distribution while maintaining its grid topology.
Formal Setup
Let be input data with . The SOM consists of neurons arranged on a grid (typically 2D), each with a weight vector for .
Best Matching Unit (BMU)
For an input , the best matching unit (BMU) is the neuron whose weight vector is closest to :
This is the competitive step: neurons compete to be the "winner" for each input. Only the winner and its neighbors are updated.
SOM Update Rule
After finding the BMU for input , update all weight vectors:
where is the learning rate (decreasing with iteration ), and is the neighborhood function that decays with grid distance between neuron and the BMU . A common choice is a Gaussian:
where denotes the grid position of neuron and is the neighborhood radius, which shrinks over time.
The SOM Algorithm
- Initialize weight vectors randomly (or from a PCA-based initialization)
- For each epoch:
- For each input (presented in random order):
- Find the BMU:
- Update all neurons:
- Decrease and
- For each input (presented in random order):
- Stop when the learning rate is sufficiently small
The neighborhood radius starts large (covering most of the grid) and shrinks to cover only the BMU itself. Early iterations organize the global structure. Late iterations refine local details.
Main Theorems
Topology Preservation of SOMs
Statement
For a one-dimensional SOM (neurons on a line) with a continuous, connected input distribution, the weight vectors converge to a monotone ordering that preserves the topology of the input space. Formally, if in the input space, then and are ordered correspondingly on the grid. In the 1D case with and appropriately, the SOM converges to a topology-preserving mapping with probability 1.
Intuition
The neighborhood function is the key mechanism. When the BMU updates, its neighbors on the grid are pulled in the same direction. This creates a "chain" effect: adjacent grid neurons are dragged toward nearby regions of input space. If neuron 5 captures data near and neuron 6 captures data near , the neighborhood coupling ensures that and remain close in data space, preserving the grid adjacency as spatial proximity.
Why It Matters
Topology preservation is what makes SOMs useful for visualization. When you color or label the grid neurons, spatially adjacent neurons represent similar data. Clusters appear as contiguous regions on the grid. Dissimilar data maps to distant grid positions. Without topology preservation, the grid would be a random assignment of data to neurons with no interpretable structure.
Failure Mode
Topology preservation is proven only for the 1D case. In 2D and higher, the SOM can develop "twists" or "folds" where the grid topology does not match the data topology. This happens when the neighborhood radius shrinks too quickly, freezing the map before global ordering is achieved. Large initial neighborhood radius and slow annealing help but do not guarantee perfect topology preservation in 2D.
Connection to Other Methods
K-means clustering: if the neighborhood radius is set to zero (no neighborhood influence), the SOM reduces to online k-means: each input updates only the BMU. The topology-preserving property vanishes. SOMs are k-means with the additional constraint that cluster centers must respect a grid topology.
t-SNE and UMAP: like SOMs, these methods preserve neighborhood structure in a low-dimensional embedding. The mathematical approaches differ. SOMs use competitive learning with neighborhood functions. t-SNE minimizes KL divergence between high-dimensional and low-dimensional pairwise similarity distributions. UMAP uses cross-entropy on fuzzy simplicial sets. SOMs produce a discrete grid; t-SNE and UMAP produce continuous embeddings.
PCA: PCA finds linear projections that maximize variance. SOMs find nonlinear mappings that preserve topology. PCA is a spectral method; SOMs are a learning algorithm. PCA is optimal for Gaussian data; SOMs handle arbitrary manifold structure.
Practical Considerations
Grid size: too small and distinct clusters merge; too large and the map is sparse with many empty neurons. Common choices: to for moderate datasets.
Initialization: random initialization works but may require many iterations for global ordering. PCA-based initialization (aligning the grid with the first two principal components) converges faster and reduces the risk of topological defects.
U-matrix visualization: the U-matrix shows the distance between adjacent neuron weight vectors. Large distances indicate cluster boundaries. Small distances indicate within-cluster regions. This is the primary visualization tool for trained SOMs.
SOMs are not dimensionality reduction in the same sense as PCA
PCA computes a linear projection applicable to any new data point. SOMs learn a discrete mapping specific to the training data. To project a new point, you find its BMU, which is a nearest-neighbor lookup, not a linear transformation. SOMs do not provide a continuous mapping from input space to the grid.
The neighborhood function is not optional
Without the neighborhood function, the SOM degenerates to k-means. The neighborhood function is what creates the topological structure of the map. Setting the neighborhood radius too small too early produces a map with no spatial coherence.
Summary
- SOMs map high-dimensional data to a low-dimensional grid via competitive learning with neighborhood updates
- BMU selection is nearest-neighbor; the neighborhood function pulls nearby grid neurons toward similar data
- Topology preservation is proven for 1D; in 2D it depends on annealing schedules and initialization
- With zero neighborhood radius, SOMs reduce to online k-means
- The U-matrix visualizes cluster boundaries on the trained grid
- SOMs are a precursor to t-SNE and UMAP; all preserve neighborhood structure using different mathematics
Exercises
Problem
A SOM has a grid (100 neurons) with weight vectors in . For an input , the BMU is neuron . The neighborhood function is Gaussian with . Compute the neighborhood weight for neuron .
Problem
Explain why a SOM trained with a constant (non-decreasing) neighborhood radius fails to converge. What goes wrong, and what does the trained map look like?
References
Canonical:
- Kohonen, Self-Organizing Maps (Springer, 3rd edition, 2001), Chapters 3-5
- Kohonen, "Self-organized formation of topologically correct feature maps" (1982)
Current:
-
Vesanto & Alhoniemi, "Clustering of the self-organizing map" (IEEE Transactions, 2000)
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
-
Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 1-28
Next Topics
The natural next steps from self-organizing maps:
- t-SNE and UMAP: modern nonlinear dimensionality reduction methods that generalize the neighborhood-preservation idea
- K-means clustering: the degenerate case of SOMs without neighborhood coupling
Last reviewed: April 2026