ML Methods
Graph Neural Networks
Message passing on graphs: GCN, GAT, GraphSAGE, the WL isomorphism test as an expressivity ceiling, over-smoothing in deep GNNs, and applications to molecules, social networks, and knowledge graphs.
Why This Matters
Many important data structures are graphs: molecules (atoms as nodes, bonds as edges), social networks, citation networks, knowledge graphs, protein interaction networks. Standard neural networks operate on fixed-size vectors or grids. Graph Neural Networks (GNNs) operate directly on graph-structured data, respecting the topology.
The central idea is message passing: each node updates its representation by aggregating information from its neighbors. This is analogous to how CNNs aggregate information from spatial neighborhoods, but generalized to irregular graph structures.
Formal Setup
Let be a graph with node set () and edge set . Each node has a feature vector . Let denote the neighbors of .
Message Passing Neural Network
A message passing layer updates node representations as:
The AGGREGATE function collects neighbor information (must be permutation-invariant: the output does not depend on the ordering of neighbors). The UPDATE function combines the node's current representation with the aggregated message. After layers, each node's representation captures information from its -hop neighborhood.
Adjacency and Degree Matrices
The adjacency matrix has if . The degree matrix is diagonal with . The normalized adjacency is where (self-loops added) and .
Graph Convolutional Network (GCN)
GCN as First-Order Spectral Approximation
Statement
The GCN layer (Kipf and Welling, 2017) computes:
where is the matrix of node features at layer , is a learnable weight matrix, and is a nonlinearity (typically ReLU).
This is derived from spectral graph convolution (where is the eigendecomposition of the graph Laplacian) by approximating with a first-order Chebyshev polynomial and simplifying.
Intuition
The multiplication by averages each node's features with its neighbors' features (with symmetric normalization to prevent scale issues). The weight matrix then transforms this averaged representation. One GCN layer is equivalent to: (1) smooth features over the graph, (2) apply a linear transform, (3) apply nonlinearity. Stacking layers propagates information hops.
Proof Sketch
Start with spectral convolution: . Approximate (first-order Chebyshev with the normalized Laplacian ). This gives . Set (parameter sharing) to get . The renormalization trick replaces with to stabilize eigenvalues.
Why It Matters
GCN provides a simple, efficient message passing operation grounded in spectral graph theory. The spectral derivation clarifies what GCN does: it is a low-pass filter on the graph (smoothing node features along edges). This connection to spectral theory also explains GCN's limitations, particularly over-smoothing.
Failure Mode
GCN treats all neighbors equally (after degree normalization). It cannot learn to weight some neighbors more than others. For heterogeneous graphs where some edges are more informative, this uniform aggregation is suboptimal. GAT addresses this by learning attention weights per edge.
Graph Attention Network (GAT)
GAT (Velickovic et al., 2018) replaces the fixed normalization in GCN with learned attention weights:
where is a learned attention vector and denotes concatenation. GAT typically uses multi-head attention: independent attention heads whose outputs are concatenated.
The advantage: GAT can learn which neighbors are more relevant for each node. The cost: attention computation adds overhead per layer, and the attention mechanism introduces more parameters.
GraphSAGE
GraphSAGE (Hamilton et al., 2017) uses sampling and aggregation for scalability:
- For each node, sample a fixed-size subset of neighbors (e.g., 25 from 1-hop, 10 from 2-hop)
- Aggregate sampled neighbors using mean, LSTM, or max-pool
- Concatenate the node's own features with the aggregated neighbor features
- Apply a linear transform and nonlinearity
The sampling makes GraphSAGE scalable to large graphs (millions of nodes) because the computation per node is bounded regardless of the node's degree. GCN and GAT aggregate over all neighbors, which is expensive for high-degree nodes.
WL Isomorphism Test and GNN Expressivity
GNN Expressivity Upper Bound via WL Test
Statement
The Weisfeiler-Leman (WL) graph isomorphism test iteratively refines node labels: at each step, a node's label is updated based on its current label and the multiset of its neighbors' labels. Two graphs are "WL-equivalent" if the WL test cannot distinguish them.
Upper bound: No message passing GNN can distinguish two graphs that the 1-WL test cannot distinguish.
Achievability: A GNN with injective AGGREGATE and UPDATE functions (specifically, the Graph Isomorphism Network (GIN)) is as powerful as the 1-WL test.
Intuition
Each layer of a message passing GNN refines node representations using neighbor information, exactly mirroring one iteration of the WL test. If the WL test produces identical label sequences for two nodes (or two graphs), no amount of message passing can tell them apart. The GNN's aggregation function determines whether it achieves this upper bound: sum aggregation (GIN) is maximally expressive; mean and max aggregation lose information about neighbor multisets.
Proof Sketch
The proof by Xu et al. (2019) proceeds in two steps. (1) Show that if a GNN maps two different multisets of neighbor features to the same output, it is strictly less powerful than 1-WL. Sum aggregation preserves the multiset (by the injectivity of hashing multisets to sums over an injective function). Mean and max lose information (e.g., mean cannot distinguish from ). (2) Show that with an injective update function (modeled as an MLP by the universal approximation theorem), the GNN matches 1-WL step for step.
Why It Matters
This theorem sets a hard ceiling on what standard message passing GNNs can compute. It explains why GNNs fail on certain graph tasks (e.g., counting cycles of length > 6, detecting certain subgraph patterns) and motivates higher-order GNN architectures that go beyond 1-WL (e.g., -WL GNNs, which use -tuples of nodes).
Failure Mode
The 1-WL test cannot distinguish certain non-isomorphic regular graphs (e.g., some pairs of 3-regular graphs on 8 nodes). Standard GNNs inherit this limitation. For tasks that require distinguishing such graphs (e.g., certain molecular properties), higher-order methods or augmentations (random features, positional encodings) are needed.
Over-Smoothing
As GNN depth increases, node representations converge to the same vector. After message passing layers, each node's representation depends on its -hop neighborhood. For a connected graph with diameter , once , every node sees every other node. The repeated averaging (in GCN) drives all representations toward the graph's leading eigenvector.
Formally, for GCN without nonlinearity: . Since is a normalized matrix with leading eigenvalue 1, converges to the rank-1 matrix (where is the stationary distribution), causing all node features to converge.
Practical consequence: GNNs with more than 4-8 layers often perform worse than shallower ones. This limits the effective receptive field and distinguishes GNNs from CNNs, where additional depth consistently helps up to much larger counts.
Mitigations: residual connections, jumping knowledge (aggregate across all layers), normalization techniques (PairNorm, DiffPool), and graph transformers that bypass local message passing.
Applications
Molecular property prediction. Atoms are nodes, bonds are edges. GNNs predict molecular properties (toxicity, solubility, binding affinity) from the molecular graph. The Open Catalyst Project uses GNNs for catalyst discovery.
Social network analysis. Users are nodes, connections are edges. GNNs predict user behavior, detect communities, and identify influential nodes.
Recommendation systems. Users and items form a bipartite graph. GNNs propagate preferences through the graph to predict ratings. PinSage (Ying et al., 2018) deployed a GNN for Pinterest recommendations at scale.
Knowledge graph completion. Entities are nodes, relations are edges. GNNs predict missing links (e.g., given "X is a protein" and "X interacts with Y," predict Y's function).
Common Confusions
GNNs are not invariant to graph isomorphism by default
A GNN with mean aggregation cannot distinguish certain non-isomorphic graphs (limited by 1-WL). Only GNNs with injective aggregation (GIN) achieve maximal distinguishing power within the 1-WL limit. Claims that "GNNs respect graph structure" must be qualified by the expressivity bound.
Graph convolution is not the same as image convolution
In images, the grid structure is fixed and uniform (every pixel has the same number of neighbors in the same relative positions). In graphs, nodes have varying degrees and no canonical ordering of neighbors. Graph convolution must be permutation-invariant with respect to neighbor ordering, while image convolution exploits the fixed grid.
Over-smoothing is not the same as oversmoothing in statistics
Over-smoothing in GNNs refers to node representations converging to the same vector with increasing depth. This is a structural property of iterated message passing on graphs, not a bias-variance issue. It has no analogue in standard feedforward networks, where depth generally increases representational power.
Canonical Examples
GCN on a citation network
In the Cora citation network (2,708 papers, 5,429 citation edges, 7 classes), a 2-layer GCN with 16 hidden units achieves approximately 81% node classification accuracy using only 140 labeled nodes (20 per class). The GCN propagates label information through the citation graph: if paper A cites papers B and C which are both about "neural networks," the GCN infers A is likely about neural networks too. A 3-layer GCN performs similarly; a 10-layer GCN drops to about 60% due to over-smoothing.
Exercises
Problem
Write the GCN update rule for a single node with neighbors in a graph with self-loops, using symmetric normalization. If all four nodes (including ) have degree 3 (after adding self-loops: degree 4), what is the coefficient on each neighbor's features?
Problem
Prove that mean aggregation cannot distinguish the multisets and but sum aggregation can. Give an example of two non-isomorphic graphs where this distinction matters for node classification.
Problem
Why does over-smoothing occur in GCNs but not in standard deep feedforward networks? Relate your answer to the spectral properties of the normalized adjacency matrix .
References
Canonical:
- Kipf & Welling, "Semi-Supervised Classification with Graph Convolutional Networks" (ICLR 2017)
- Velickovic et al., "Graph Attention Networks" (ICLR 2018)
- Xu et al., "How Powerful are Graph Neural Networks?" (ICLR 2019), Sections 3-5
Current:
- Wu et al., "A Comprehensive Survey on Graph Neural Networks" (IEEE TNNLS 2021), Sections 2-5
- Hamilton, Graph Representation Learning (2020), Chapters 5-7
Next Topics
GNNs connect to spectral graph theory, geometric deep learning, and the study of equivariant neural networks.
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Convolutional Neural NetworksLayer 3
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A
- Vectors, Matrices, and Linear MapsLayer 0A
- Eigenvalues and EigenvectorsLayer 0A
Builds on This
- Equivariant Deep LearningLayer 4