Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Graph Neural Networks

Message passing on graphs: GCN, GAT, GraphSAGE, the WL isomorphism test as an expressivity ceiling, over-smoothing in deep GNNs, and applications to molecules, social networks, and knowledge graphs.

AdvancedTier 2Current~55 min

Why This Matters

Many important data structures are graphs: molecules (atoms as nodes, bonds as edges), social networks, citation networks, knowledge graphs, protein interaction networks. Standard neural networks operate on fixed-size vectors or grids. Graph Neural Networks (GNNs) operate directly on graph-structured data, respecting the topology.

The central idea is message passing: each node updates its representation by aggregating information from its neighbors. This is analogous to how CNNs aggregate information from spatial neighborhoods, but generalized to irregular graph structures.

Formal Setup

Let G=(V,E)G = (V, E) be a graph with node set VV (V=n|V| = n) and edge set EE. Each node vv has a feature vector hv(0)Rdh_v^{(0)} \in \mathbb{R}^d. Let N(v)\mathcal{N}(v) denote the neighbors of vv.

Definition

Message Passing Neural Network

A message passing layer updates node representations as:

hv(+1)=UPDATE()(hv(),AGGREGATE()({hu():uN(v)}))h_v^{(\ell+1)} = \text{UPDATE}^{(\ell)}\left(h_v^{(\ell)}, \text{AGGREGATE}^{(\ell)}\left(\{h_u^{(\ell)} : u \in \mathcal{N}(v)\}\right)\right)

The AGGREGATE function collects neighbor information (must be permutation-invariant: the output does not depend on the ordering of neighbors). The UPDATE function combines the node's current representation with the aggregated message. After LL layers, each node's representation captures information from its LL-hop neighborhood.

Definition

Adjacency and Degree Matrices

The adjacency matrix A{0,1}n×nA \in \{0,1\}^{n \times n} has Aij=1A_{ij} = 1 if (i,j)E(i,j) \in E. The degree matrix DD is diagonal with Dii=jAijD_{ii} = \sum_j A_{ij}. The normalized adjacency is A^=D~1/2A~D~1/2\hat{A} = \tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} where A~=A+I\tilde{A} = A + I (self-loops added) and D~ii=jA~ij\tilde{D}_{ii} = \sum_j \tilde{A}_{ij}.

Graph Convolutional Network (GCN)

Proposition

GCN as First-Order Spectral Approximation

Statement

The GCN layer (Kipf and Welling, 2017) computes:

H(+1)=σ(A^H()W())H^{(\ell+1)} = \sigma\left(\hat{A} H^{(\ell)} W^{(\ell)}\right)

where H()Rn×dH^{(\ell)} \in \mathbb{R}^{n \times d_\ell} is the matrix of node features at layer \ell, W()Rd×d+1W^{(\ell)} \in \mathbb{R}^{d_\ell \times d_{\ell+1}} is a learnable weight matrix, and σ\sigma is a nonlinearity (typically ReLU).

This is derived from spectral graph convolution gθx=Ugθ(Λ)Uxg_\theta \star x = U g_\theta(\Lambda) U^\top x (where L=UΛUL = U\Lambda U^\top is the eigendecomposition of the graph Laplacian) by approximating gθ(Λ)g_\theta(\Lambda) with a first-order Chebyshev polynomial and simplifying.

Intuition

The multiplication by A^\hat{A} averages each node's features with its neighbors' features (with symmetric normalization to prevent scale issues). The weight matrix WW then transforms this averaged representation. One GCN layer is equivalent to: (1) smooth features over the graph, (2) apply a linear transform, (3) apply nonlinearity. Stacking LL layers propagates information LL hops.

Proof Sketch

Start with spectral convolution: gθx=Ugθ(Λ)Uxg_\theta \star x = U g_\theta(\Lambda) U^\top x. Approximate gθ(λ)θ0+θ1λg_\theta(\lambda) \approx \theta_0 + \theta_1 \lambda (first-order Chebyshev with the normalized Laplacian L^=IA^\hat{L} = I - \hat{A}). This gives gθxθ0x+θ1A^xg_\theta \star x \approx \theta_0 x + \theta_1 \hat{A} x. Set θ0=θ1=θ\theta_0 = \theta_1 = \theta (parameter sharing) to get gθx=θ(I+A^)xg_\theta \star x = \theta(I + \hat{A})x. The renormalization trick replaces I+AI + A with D~1/2A~D~1/2\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} to stabilize eigenvalues.

Why It Matters

GCN provides a simple, efficient message passing operation grounded in spectral graph theory. The spectral derivation clarifies what GCN does: it is a low-pass filter on the graph (smoothing node features along edges). This connection to spectral theory also explains GCN's limitations, particularly over-smoothing.

Failure Mode

GCN treats all neighbors equally (after degree normalization). It cannot learn to weight some neighbors more than others. For heterogeneous graphs where some edges are more informative, this uniform aggregation is suboptimal. GAT addresses this by learning attention weights per edge.

Graph Attention Network (GAT)

GAT (Velickovic et al., 2018) replaces the fixed normalization in GCN with learned attention weights:

αij=exp(LeakyReLU(a[WhiWhj]))kN(i)exp(LeakyReLU(a[WhiWhk]))\alpha_{ij} = \frac{\exp(\text{LeakyReLU}(a^\top [Wh_i \| Wh_j]))}{\sum_{k \in \mathcal{N}(i)} \exp(\text{LeakyReLU}(a^\top [Wh_i \| Wh_k]))}

hi(+1)=σ(jN(i)αijWhj())h_i^{(\ell+1)} = \sigma\left(\sum_{j \in \mathcal{N}(i)} \alpha_{ij} W h_j^{(\ell)}\right)

where aR2da \in \mathbb{R}^{2d'} is a learned attention vector and \| denotes concatenation. GAT typically uses multi-head attention: KK independent attention heads whose outputs are concatenated.

The advantage: GAT can learn which neighbors are more relevant for each node. The cost: attention computation adds O(Ed)O(|E| \cdot d) overhead per layer, and the attention mechanism introduces more parameters.

GraphSAGE

GraphSAGE (Hamilton et al., 2017) uses sampling and aggregation for scalability:

  1. For each node, sample a fixed-size subset of neighbors (e.g., 25 from 1-hop, 10 from 2-hop)
  2. Aggregate sampled neighbors using mean, LSTM, or max-pool
  3. Concatenate the node's own features with the aggregated neighbor features
  4. Apply a linear transform and nonlinearity

The sampling makes GraphSAGE scalable to large graphs (millions of nodes) because the computation per node is bounded regardless of the node's degree. GCN and GAT aggregate over all neighbors, which is expensive for high-degree nodes.

WL Isomorphism Test and GNN Expressivity

Theorem

GNN Expressivity Upper Bound via WL Test

Statement

The Weisfeiler-Leman (WL) graph isomorphism test iteratively refines node labels: at each step, a node's label is updated based on its current label and the multiset of its neighbors' labels. Two graphs are "WL-equivalent" if the WL test cannot distinguish them.

Upper bound: No message passing GNN can distinguish two graphs that the 1-WL test cannot distinguish.

Achievability: A GNN with injective AGGREGATE and UPDATE functions (specifically, the Graph Isomorphism Network (GIN)) is as powerful as the 1-WL test.

Intuition

Each layer of a message passing GNN refines node representations using neighbor information, exactly mirroring one iteration of the WL test. If the WL test produces identical label sequences for two nodes (or two graphs), no amount of message passing can tell them apart. The GNN's aggregation function determines whether it achieves this upper bound: sum aggregation (GIN) is maximally expressive; mean and max aggregation lose information about neighbor multisets.

Proof Sketch

The proof by Xu et al. (2019) proceeds in two steps. (1) Show that if a GNN maps two different multisets of neighbor features to the same output, it is strictly less powerful than 1-WL. Sum aggregation preserves the multiset (by the injectivity of hashing multisets to sums over an injective function). Mean and max lose information (e.g., mean cannot distinguish {1,1,1}\{1, 1, 1\} from {1}\{1\}). (2) Show that with an injective update function (modeled as an MLP by the universal approximation theorem), the GNN matches 1-WL step for step.

Why It Matters

This theorem sets a hard ceiling on what standard message passing GNNs can compute. It explains why GNNs fail on certain graph tasks (e.g., counting cycles of length > 6, detecting certain subgraph patterns) and motivates higher-order GNN architectures that go beyond 1-WL (e.g., kk-WL GNNs, which use kk-tuples of nodes).

Failure Mode

The 1-WL test cannot distinguish certain non-isomorphic regular graphs (e.g., some pairs of 3-regular graphs on 8 nodes). Standard GNNs inherit this limitation. For tasks that require distinguishing such graphs (e.g., certain molecular properties), higher-order methods or augmentations (random features, positional encodings) are needed.

Over-Smoothing

As GNN depth increases, node representations converge to the same vector. After LL message passing layers, each node's representation depends on its LL-hop neighborhood. For a connected graph with diameter dd, once LdL \geq d, every node sees every other node. The repeated averaging (in GCN) drives all representations toward the graph's leading eigenvector.

Formally, for GCN without nonlinearity: H(L)=A^LH(0)W(0)W(L1)H^{(L)} = \hat{A}^L H^{(0)} W^{(0)} \cdots W^{(L-1)}. Since A^\hat{A} is a normalized matrix with leading eigenvalue 1, A^L\hat{A}^L converges to the rank-1 matrix ππ/π2\pi \pi^\top / \|\pi\|^2 (where π\pi is the stationary distribution), causing all node features to converge.

Practical consequence: GNNs with more than 4-8 layers often perform worse than shallower ones. This limits the effective receptive field and distinguishes GNNs from CNNs, where additional depth consistently helps up to much larger counts.

Mitigations: residual connections, jumping knowledge (aggregate across all layers), normalization techniques (PairNorm, DiffPool), and graph transformers that bypass local message passing.

Applications

Molecular property prediction. Atoms are nodes, bonds are edges. GNNs predict molecular properties (toxicity, solubility, binding affinity) from the molecular graph. The Open Catalyst Project uses GNNs for catalyst discovery.

Social network analysis. Users are nodes, connections are edges. GNNs predict user behavior, detect communities, and identify influential nodes.

Recommendation systems. Users and items form a bipartite graph. GNNs propagate preferences through the graph to predict ratings. PinSage (Ying et al., 2018) deployed a GNN for Pinterest recommendations at scale.

Knowledge graph completion. Entities are nodes, relations are edges. GNNs predict missing links (e.g., given "X is a protein" and "X interacts with Y," predict Y's function).

Common Confusions

Watch Out

GNNs are not invariant to graph isomorphism by default

A GNN with mean aggregation cannot distinguish certain non-isomorphic graphs (limited by 1-WL). Only GNNs with injective aggregation (GIN) achieve maximal distinguishing power within the 1-WL limit. Claims that "GNNs respect graph structure" must be qualified by the expressivity bound.

Watch Out

Graph convolution is not the same as image convolution

In images, the grid structure is fixed and uniform (every pixel has the same number of neighbors in the same relative positions). In graphs, nodes have varying degrees and no canonical ordering of neighbors. Graph convolution must be permutation-invariant with respect to neighbor ordering, while image convolution exploits the fixed grid.

Watch Out

Over-smoothing is not the same as oversmoothing in statistics

Over-smoothing in GNNs refers to node representations converging to the same vector with increasing depth. This is a structural property of iterated message passing on graphs, not a bias-variance issue. It has no analogue in standard feedforward networks, where depth generally increases representational power.

Canonical Examples

Example

GCN on a citation network

In the Cora citation network (2,708 papers, 5,429 citation edges, 7 classes), a 2-layer GCN with 16 hidden units achieves approximately 81% node classification accuracy using only 140 labeled nodes (20 per class). The GCN propagates label information through the citation graph: if paper A cites papers B and C which are both about "neural networks," the GCN infers A is likely about neural networks too. A 3-layer GCN performs similarly; a 10-layer GCN drops to about 60% due to over-smoothing.

Exercises

ExerciseCore

Problem

Write the GCN update rule for a single node vv with neighbors N(v)={u1,u2,u3}\mathcal{N}(v) = \{u_1, u_2, u_3\} in a graph with self-loops, using symmetric normalization. If all four nodes (including vv) have degree 3 (after adding self-loops: degree 4), what is the coefficient on each neighbor's features?

ExerciseAdvanced

Problem

Prove that mean aggregation cannot distinguish the multisets {1,1,1,1}\{1, 1, 1, 1\} and {1,1}\{1, 1\} but sum aggregation can. Give an example of two non-isomorphic graphs where this distinction matters for node classification.

ExerciseResearch

Problem

Why does over-smoothing occur in GCNs but not in standard deep feedforward networks? Relate your answer to the spectral properties of the normalized adjacency matrix A^\hat{A}.

References

Canonical:

  • Kipf & Welling, "Semi-Supervised Classification with Graph Convolutional Networks" (ICLR 2017)
  • Velickovic et al., "Graph Attention Networks" (ICLR 2018)
  • Xu et al., "How Powerful are Graph Neural Networks?" (ICLR 2019), Sections 3-5

Current:

  • Wu et al., "A Comprehensive Survey on Graph Neural Networks" (IEEE TNNLS 2021), Sections 2-5
  • Hamilton, Graph Representation Learning (2020), Chapters 5-7

Next Topics

GNNs connect to spectral graph theory, geometric deep learning, and the study of equivariant neural networks.

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This