Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Algorithms Foundations

Graph Algorithms Essentials

The graph algorithms every ML practitioner needs: BFS, DFS, Dijkstra, MST, and topological sort. Why they matter for computational graphs, knowledge graphs, dependency resolution, and GNNs.

CoreTier 2Stable~45 min
0

Why This Matters

Graphs are everywhere in ML, even when you do not see them. A neural network's computation is a directed acyclic graph. Backpropagation is reverse topological order traversal. Package dependencies form a DAG. Knowledge graphs encode relational data. Graph neural networks operate directly on graph-structured data.

You cannot understand automatic differentiation without topological sort. You cannot understand message passing in GNNs without BFS. You cannot understand shortest-path problems in reasoning without Dijkstra. These algorithms are the vocabulary of graph computation.

Mental Model

A graph is a set of nodes connected by edges. Algorithms on graphs answer fundamental questions: can I reach node B from node A? What is the shortest path? What is the cheapest way to connect all nodes? Is there a valid ordering of nodes that respects all dependencies?

Each algorithm uses a different strategy: BFS explores level by level (breadth first), DFS explores as deep as possible first, Dijkstra greedily expands the closest unvisited node, and MST algorithms greedily add the cheapest safe edge.

Core Definitions

Definition

Graph

A graph G=(V,E)G = (V, E) consists of a set of vertices (nodes) VV and a set of edges EE. Edges can be directed (ordered pairs (u,v)(u, v)) or undirected (unordered pairs {u,v}\{u, v\}). A weighted graph assigns a weight w(u,v)w(u, v) to each edge. We use n=Vn = |V| for the number of vertices and m=Em = |E| for the number of edges.

Definition

Directed Acyclic Graph (DAG)

A DAG is a directed graph with no directed cycles. Every DAG has at least one topological ordering. Computational graphs in neural networks are DAGs: nodes are operations, edges are data dependencies, and the absence of cycles means the computation can proceed in a well-defined order.

BFS: Breadth-First Search

Definition

Breadth-First Search (BFS)

BFS explores a graph layer by layer from a source node ss. It uses a queue:

  1. Initialize: enqueue ss, mark ss as visited, set dist(s)=0\text{dist}(s) = 0
  2. While the queue is not empty:
    • Dequeue node uu
    • For each neighbor vv of uu that is unvisited:
      • Mark vv as visited, set dist(v)=dist(u)+1\text{dist}(v) = \text{dist}(u) + 1
      • Enqueue vv

Time complexity: O(n+m)O(n + m). Space complexity: O(n)O(n).

BFS computes the shortest path in unweighted graphs: the first time BFS reaches a node, it has found the shortest path (fewest edges) from the source. This follows directly from the layer-by-layer exploration: all nodes at distance kk are visited before any node at distance k+1k+1.

ML application: In GNNs, the kk-hop neighborhood of a node (all nodes reachable in at most kk steps) is exactly what kk rounds of BFS-like message passing would explore. The number of GNN layers determines the receptive field, analogous to BFS depth.

DFS: Depth-First Search

Definition

Depth-First Search (DFS)

DFS explores as far as possible along each branch before backtracking. It uses a stack (or recursion):

  1. Initialize: mark all nodes as unvisited
  2. For each unvisited node uu: call DFS-Visit(u)\text{DFS-Visit}(u)
  3. DFS-Visit(u)\text{DFS-Visit}(u): mark uu as visited. For each neighbor vv of uu that is unvisited, recursively call DFS-Visit(v)\text{DFS-Visit}(v). After all neighbors are processed, record uu's finish time.

Time complexity: O(n+m)O(n + m). Space complexity: O(n)O(n).

DFS provides two critical capabilities:

  1. Cycle detection: a directed graph has a cycle if and only if DFS encounters a back edge (an edge from a node to one of its ancestors in the DFS tree). This is used to validate that computational graphs are acyclic.
  2. Topological sort: process nodes in reverse order of DFS finish times. Nodes that finish later are placed earlier in the ordering.

Topological Sort

Definition

Topological Sort

A topological ordering of a DAG is a linear ordering of its vertices such that for every directed edge (u,v)(u, v), uu appears before vv. A topological sort can be computed by:

  1. Run DFS on the entire graph
  2. Output vertices in reverse order of finish times

Alternatively, use Kahn's algorithm: repeatedly remove nodes with in-degree zero and add them to the ordering.

Time complexity: O(n+m)O(n + m).

ML application: Automatic differentiation computes gradients by traversing the computational graph in reverse topological order. The forward pass evaluates operations in topological order (each operation runs after its inputs are available). The backward pass accumulates gradients in reverse topological order (each node's gradient is computed after all downstream gradients are available). This is why backpropagation requires a DAG structure.

Dijkstra's Algorithm

Theorem

Dijkstra's Algorithm Correctness

Statement

Dijkstra's algorithm computes the shortest-path distance from source ss to all reachable vertices. It maintains a set SS of vertices whose shortest distances are finalized and a priority queue of tentative distances. At each step, it extracts the vertex uu with minimum tentative distance, adds uu to SS, and relaxes all edges leaving uu:

dist(v)min(dist(v),dist(u)+w(u,v))\text{dist}(v) \leftarrow \min(\text{dist}(v), \text{dist}(u) + w(u, v))

When uu is extracted, dist(u)\text{dist}(u) equals the true shortest-path distance δ(s,u)\delta(s, u).

Time complexity: O((n+m)logn)O((n + m) \log n) with a binary heap, O(m+nlogn)O(m + n \log n) with a Fibonacci heap.

Intuition

Dijkstra's algorithm is a greedy algorithm. At each step, it finalizes the unvisited vertex closest to the source. This is correct because all edge weights are non-negative: no future path through an unvisited vertex can be shorter than the direct path already found. The vertex with the smallest tentative distance cannot be improved by going through other unvisited vertices (which are farther away) and then traversing non-negative edges.

Proof Sketch

By induction on the number of finalized vertices. Base case: dist(s)=0\text{dist}(s) = 0 is correct. Inductive step: suppose all vertices in SS have correct distances. Let uu be the next vertex extracted. Assume for contradiction that dist(u)>δ(s,u)\text{dist}(u) > \delta(s, u). Then the true shortest path from ss to uu passes through some vertex vSv \notin S before reaching uu. But δ(s,v)δ(s,u)dist(u)\delta(s, v) \leq \delta(s, u) \leq \text{dist}(u), and since vSv \notin S, dist(v)dist(u)\text{dist}(v) \geq \text{dist}(u) (because uu was chosen as the minimum). But dist(v)δ(s,v)+w(v,,u)δ(s,u)\text{dist}(v) \geq \delta(s, v) + w(v, \ldots, u) \geq \delta(s, u), contradicting the assumption. Non-negative weights are essential: with negative edges, a vertex farther in the queue could lead to a shorter total path.

Why It Matters

Dijkstra's algorithm is the standard for single-source shortest paths with non-negative weights. In ML, shortest-path computations appear in knowledge graph reasoning (finding the closest entity), graph-based semi-supervised learning (label propagation weighted by distance), and network analysis.

Failure Mode

Dijkstra fails with negative edge weights. Consider: sas \to a with weight 1, sbs \to b with weight 3, and bab \to a with weight 4-4. Dijkstra finalizes aa with distance 1, but the true shortest path sbas \to b \to a has distance 3+(4)=13 + (-4) = -1. For negative weights (without negative cycles), use Bellman-Ford (O(nm)O(nm)).

Minimum Spanning Trees

Theorem

MST Cut Property

Statement

For any cut (S,VS)(S, V \setminus S) of the graph, the minimum-weight edge crossing the cut is in every minimum spanning tree. Formally, if ee is the unique lightest edge with one endpoint in SS and the other in VSV \setminus S, then ee belongs to every MST.

Intuition

A cut divides the graph into two parts. Any spanning tree must have at least one edge crossing the cut (otherwise the two parts are disconnected). Replacing a heavier crossing edge with the lightest crossing edge reduces the tree's total weight, so the lightest crossing edge must be in the MST.

Proof Sketch

Let TT be an MST that does not contain e=(u,v)e = (u, v) where uSu \in S and vVSv \in V \setminus S. Adding ee to TT creates a cycle. This cycle must contain another edge ee' crossing the cut (the cycle enters SS and must leave it again). Since w(e)<w(e)w(e) < w(e') (by the distinct-weights assumption), T=T{e}{e}T' = T \cup \{e\} \setminus \{e'\} has lower total weight than TT, contradicting TT being an MST.

Why It Matters

The cut property is the foundation of both Kruskal's and Prim's algorithms. It provides a greedy certificate: if an edge is the lightest across some cut, it is safe to include it. In ML, minimum spanning trees appear in hierarchical clustering (single-linkage clustering is equivalent to computing the MST and cutting the heaviest edges) and in constructing sparse graph structures for graph-based learning.

Failure Mode

If edge weights are not distinct, the lightest crossing edge may not be unique, and there may be multiple MSTs. The cut property still guarantees that some lightest crossing edge is in some MST, but not that a specific edge is in every MST.

Kruskal's algorithm: Sort edges by weight. Add each edge if it does not create a cycle (check with union-find). Time: O(mlogm)O(m \log m).

Prim's algorithm: Grow the MST from a starting vertex, always adding the cheapest edge connecting the tree to a non-tree vertex. Time: O(mlogn)O(m \log n) with a binary heap.

Why ML People Need Graph Algorithms

  1. Computational graphs: neural network forward/backward passes are topological sort on a DAG
  2. Knowledge graphs: entity-relation triples form directed graphs; shortest paths and connectivity queries support reasoning
  3. Dependency resolution: package managers, data pipelines, and build systems use topological sort to order tasks
  4. GNNs: message-passing GNNs generalize BFS-like neighborhood aggregation; understanding graph traversal clarifies the GNN receptive field
  5. Clustering: single-linkage hierarchical clustering is MST-based; spectral clustering uses graph Laplacian eigenvectors

Common Confusions

Watch Out

BFS finds shortest paths only in unweighted graphs

BFS gives shortest paths when all edges have equal weight (or no weight). For weighted graphs, BFS can give incorrect distances. Dijkstra handles non-negative weights. Bellman-Ford handles negative weights (without negative cycles). Using BFS on a weighted graph is a common bug.

Watch Out

Topological sort exists only for DAGs

If a directed graph has a cycle, no topological ordering exists (you cannot place all cycle members before one another). Kahn's algorithm detects this: if the algorithm terminates without visiting all vertices, the graph contains a cycle. This is how dependency systems detect circular dependencies.

Watch Out

Dijkstra is greedy, but not all greedy shortest-path approaches work

Dijkstra's greedy choice (always finalize the closest vertex) works because of non-negative weights. A different greedy strategy (e.g., always relax the heaviest edge first) would not produce correct shortest paths. The correctness depends on the specific greedy criterion matching the problem structure.

Summary

  • BFS: queue-based, O(n+m)O(n+m), shortest paths in unweighted graphs
  • DFS: stack/recursion-based, O(n+m)O(n+m), cycle detection and topological sort
  • Topological sort: linear ordering respecting all directed edges; only for DAGs; used in backpropagation and dependency resolution
  • Dijkstra: greedy shortest path with non-negative weights, O((n+m)logn)O((n+m)\log n); fails with negative weights
  • MST (Kruskal/Prim): greedy via cut property; used in hierarchical clustering
  • Graph algorithms are the foundation for computational graphs, GNNs, and knowledge graph reasoning

Exercises

ExerciseCore

Problem

A neural network has 5 layers computed in sequence: input, hidden1, hidden2, hidden3, output. Draw the computational graph as a DAG. Give a topological ordering. In what order does backpropagation compute gradients?

ExerciseAdvanced

Problem

Dijkstra's algorithm on a graph with n=1000n = 1000 vertices and m=5000m = 5000 edges using a binary heap takes O((n+m)logn)O((n + m) \log n) time. Compute the concrete operation count and compare to Bellman-Ford's O(nm)O(nm) time. When would you prefer Bellman-Ford?

ExerciseResearch

Problem

In a kk-layer GNN with message passing, each node aggregates information from its kk-hop neighborhood. Relate this to BFS and explain what happens as kk increases in terms of the "over-smoothing" problem.

References

Canonical:

  • Cormen, Leiserson, Rivest, Stein, Introduction to Algorithms (CLRS), Chapters 20-23
  • Kleinberg & Tardos, Algorithm Design, Chapters 3-4

Current:

  • Hamilton, Graph Representation Learning (2020), Chapters 1-2
  • Bronstein et al., "Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges" (2021)

Next Topics

The natural next steps from graph algorithms:

Last reviewed: April 2026

Next Topics