LLM Construction
Parallel Processing Fundamentals
Data, tensor, pipeline, expert, and sequence parallelism: the five strategies for distributing model training and inference across multiple GPUs, and how frontier labs combine all of them.
Prerequisites
Why This Matters
A single GPU cannot train a frontier language model. A 405B parameter model in FP16 requires 810 GB just for parameters, far exceeding any single GPU's memory. Even if memory were unlimited, training on trillions of tokens with one GPU would take years.
Parallelism is not optional. It is a requirement. The question is which combination of parallelism strategies minimizes training time while fitting in the available hardware. Understanding these strategies lets you read training infrastructure papers and reason about why certain model sizes and cluster configurations are chosen.
Parallelism Strategies
Data Parallelism (DP)
Replicate the entire model on devices. Split each mini-batch into micro-batches, one per device. Each device computes gradients on its shard. Synchronize gradients across devices via AllReduce. Update the replicated model.
Requirement: the full model must fit on a single device. Data parallelism scales the batch size by a factor of .
Tensor Parallelism (TP)
Split individual layers across devices. For a linear layer , partition column-wise across devices. Each device computes a slice of the output. An AllGather or ReduceScatter synchronizes the result.
Requires high-bandwidth interconnect (NVLink) because every layer incurs communication. Typically used within a single node (4-8 GPUs).
Pipeline Parallelism (PP)
Assign different layers to different devices. Device 1 runs layers 1-10, device 2 runs layers 11-20, and so on. Split the mini-batch into micro-batches. While device 2 processes micro-batch 1, device 1 processes micro-batch 2.
The problem: pipeline bubbles. At the start and end of each batch, some devices are idle.
Expert Parallelism (EP)
In Mixture-of-Experts models, different experts reside on different devices. A router sends each token to the appropriate device. Communication is All-to-All: each device sends tokens to the device hosting the selected expert.
Scales model capacity without proportionally scaling per-token compute.
Sequence Parallelism (SP)
Split long sequences across devices. Each device processes a contiguous chunk of the sequence. For attention, this requires communicating key-value pairs between devices. Ring attention is one implementation: devices pass KV blocks in a ring, computing partial attention scores at each step.
Main Theorems
Ring AllReduce Communication Cost
Statement
Ring AllReduce synchronizes copies of a vector of size bytes in time:
This is for large , independent of the number of devices. Each device sends and receives bytes total.
Intuition
The ring algorithm proceeds in two phases, each with steps. In each step, every device sends a chunk of size to its neighbor. After the first phase (reduce-scatter), each device holds the sum of one chunk. After the second phase (all-gather), every device holds the full result. Total bytes per device is .
Proof Sketch
In the reduce-scatter phase, rounds send bytes each. In the all-gather phase, another rounds send bytes each. Total data transmitted per device: . Latency: rounds, each taking seconds. This is bandwidth-optimal: the total data that must cross any bisection is .
Why It Matters
Ring AllReduce is the standard gradient synchronization primitive in data parallelism. Its key property is that communication cost scales as independent of . Doubling the number of GPUs does not double the communication overhead. This enables near-linear scaling of data parallelism in practice.
Failure Mode
The result assumes the ring bandwidth is constant. In practice, inter-node bandwidth (InfiniBand at 400 Gb/s) is much lower than intra-node bandwidth (NVLink at 900 GB/s). AllReduce across nodes is slower. Also, latency (not just bandwidth) matters for small messages: where is per-message latency.
Pipeline Bubble Fraction
Statement
In a pipeline with stages and micro-batches (using the 1F1B schedule), the bubble fraction (fraction of time GPUs are idle) is:
To keep bubble overhead below , use .
Intuition
The first stage starts immediately, but the last stage must wait for micro-batches to propagate. Similarly, the last micro-batch must drain through stages after the first stage finishes. The pipeline is fully used only during the middle time slots.
Proof Sketch
Total time with micro-batches across stages is . Useful computation is per stage. Idle time per stage is . Bubble fraction: .
Why It Matters
This formula tells you the minimum number of micro-batches needed for efficient pipeline parallelism. With 8 stages and 5% target bubble overhead, you need micro-batches. This constrains the minimum effective batch size: pipeline parallelism imposes a batch size floor.
Failure Mode
The formula assumes equal computation time per stage. In practice, stages may be unbalanced (e.g., embedding layers are cheaper than transformer blocks). Unbalanced stages increase the effective bubble. The formula also ignores communication time between stages.
How Strategies Compose
Frontier training typically uses a 3D parallelism configuration:
- TP within a node: 8 GPUs connected by NVLink share each layer
- PP across groups of nodes: layers distributed across pipeline stages
- DP across remaining nodes: replicate the pipeline, split data
The total GPU count is .
For MoE models, add expert parallelism: experts are distributed across the TP group, and an All-to-All routes tokens to the correct expert.
For long-context models, add sequence parallelism to avoid the memory cost of attention on a single device.
Common Confusions
Data parallelism does not increase model size
DP replicates the model. Every GPU holds the entire model. To train a model that does not fit on one GPU, you need tensor or pipeline parallelism. DP increases throughput by processing more data per step, not by enabling larger models.
Tensor parallelism is not the same as model parallelism
Model parallelism is an umbrella term covering both tensor parallelism (splitting within layers) and pipeline parallelism (splitting across layers). These have very different communication patterns. TP requires synchronization at every layer forward and backward pass. PP only communicates activations between adjacent stages at micro-batch boundaries.
ZeRO is not a parallelism strategy
ZeRO (Zero Redundancy Optimizer) partitions optimizer states, gradients, and parameters across DP replicas. It is a memory optimization for data parallelism, not a separate parallelism dimension. ZeRO-3 (partitioning parameters) makes DP look like model parallelism from a memory perspective, but the communication pattern is different.
Summary
- DP: replicate model, split data, AllReduce gradients. Cost independent of
- TP: split layers, requires NVLink. Used within nodes (2-8 GPUs)
- PP: split layers across stages, pipeline bubbles cost
- EP: route tokens to experts on different devices (MoE only)
- SP: split sequences, needed for long context
- Frontier training uses 3D parallelism: TP x PP x DP
Exercises
Problem
You have 256 GPUs, each with 80 GB memory. Your model has 70B parameters (140 GB in FP16). What is the minimum tensor parallelism degree needed? If you use TP=2 and PP=4, how many DP replicas can you have?
Problem
You use pipeline parallelism with stages. What is the minimum number of micro-batches to keep bubble overhead below 10%? Below 1%?
References
Canonical:
- Shoeybi et al., "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism" (2020)
- Narayanan et al., "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM" (2021)
Current:
- DeepSeek-AI, "DeepSeek-V3 Technical Report" (2024), Section on training infrastructure
Next Topics
- NVIDIA GPU architectures: the hardware these parallelism strategies run on
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Distributed Training TheoryLayer 5
- Optimizer Theory: SGD, Adam, and MuonLayer 3
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Adam OptimizerLayer 2
- Gradient Descent VariantsLayer 1
- Stochastic Gradient Descent ConvergenceLayer 2
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A