Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Parallel Processing Fundamentals

Data, tensor, pipeline, expert, and sequence parallelism: the five strategies for distributing model training and inference across multiple GPUs, and how frontier labs combine all of them.

AdvancedTier 2Current~50 min
0

Why This Matters

A single GPU cannot train a frontier language model. A 405B parameter model in FP16 requires 810 GB just for parameters, far exceeding any single GPU's memory. Even if memory were unlimited, training on trillions of tokens with one GPU would take years.

Parallelism is not optional. It is a requirement. The question is which combination of parallelism strategies minimizes training time while fitting in the available hardware. Understanding these strategies lets you read training infrastructure papers and reason about why certain model sizes and cluster configurations are chosen.

Parallelism Strategies

Definition

Data Parallelism (DP)

Replicate the entire model on NN devices. Split each mini-batch into NN micro-batches, one per device. Each device computes gradients on its shard. Synchronize gradients across devices via AllReduce. Update the replicated model.

Requirement: the full model must fit on a single device. Data parallelism scales the batch size by a factor of NN.

Definition

Tensor Parallelism (TP)

Split individual layers across devices. For a linear layer Y=XWY = XW, partition WW column-wise across NN devices. Each device computes a slice of the output. An AllGather or ReduceScatter synchronizes the result.

Requires high-bandwidth interconnect (NVLink) because every layer incurs communication. Typically used within a single node (4-8 GPUs).

Definition

Pipeline Parallelism (PP)

Assign different layers to different devices. Device 1 runs layers 1-10, device 2 runs layers 11-20, and so on. Split the mini-batch into micro-batches. While device 2 processes micro-batch 1, device 1 processes micro-batch 2.

The problem: pipeline bubbles. At the start and end of each batch, some devices are idle.

Definition

Expert Parallelism (EP)

In Mixture-of-Experts models, different experts reside on different devices. A router sends each token to the appropriate device. Communication is All-to-All: each device sends tokens to the device hosting the selected expert.

Scales model capacity without proportionally scaling per-token compute.

Definition

Sequence Parallelism (SP)

Split long sequences across devices. Each device processes a contiguous chunk of the sequence. For attention, this requires communicating key-value pairs between devices. Ring attention is one implementation: devices pass KV blocks in a ring, computing partial attention scores at each step.

Main Theorems

Proposition

Ring AllReduce Communication Cost

Statement

Ring AllReduce synchronizes NN copies of a vector of size MM bytes in time:

TAllReduce=2(N1)MNBT_{\text{AllReduce}} = 2(N-1) \cdot \frac{M}{N \cdot B}

This is 2M/B\approx 2M/B for large NN, independent of the number of devices. Each device sends and receives 2(N1)M/N2(N-1)M/N bytes total.

Intuition

The ring algorithm proceeds in two phases, each with N1N-1 steps. In each step, every device sends a chunk of size M/NM/N to its neighbor. After the first phase (reduce-scatter), each device holds the sum of one chunk. After the second phase (all-gather), every device holds the full result. Total bytes per device is 2(N1)M/N2(N-1)M/N.

Proof Sketch

In the reduce-scatter phase, N1N-1 rounds send M/NM/N bytes each. In the all-gather phase, another N1N-1 rounds send M/NM/N bytes each. Total data transmitted per device: 2(N1)M/N2(N-1)M/N. Latency: 2(N1)2(N-1) rounds, each taking M/(NB)M/(NB) seconds. This is bandwidth-optimal: the total data that must cross any bisection is Ω(M)\Omega(M).

Why It Matters

Ring AllReduce is the standard gradient synchronization primitive in data parallelism. Its key property is that communication cost scales as O(M)O(M) independent of NN. Doubling the number of GPUs does not double the communication overhead. This enables near-linear scaling of data parallelism in practice.

Failure Mode

The O(M)O(M) result assumes the ring bandwidth BB is constant. In practice, inter-node bandwidth (InfiniBand at 400 Gb/s) is much lower than intra-node bandwidth (NVLink at 900 GB/s). AllReduce across nodes is slower. Also, latency (not just bandwidth) matters for small messages: T=2(N1)(α+M/(NB))T = 2(N-1)(\alpha + M/(NB)) where α\alpha is per-message latency.

Proposition

Pipeline Bubble Fraction

Statement

In a pipeline with PP stages and MM micro-batches (using the 1F1B schedule), the bubble fraction (fraction of time GPUs are idle) is:

bubble=P1M+P1\text{bubble} = \frac{P - 1}{M + P - 1}

To keep bubble overhead below ϵ\epsilon, use MP(1/ϵ1)+1M \geq P(1/\epsilon - 1) + 1.

Intuition

The first stage starts immediately, but the last stage must wait for P1P-1 micro-batches to propagate. Similarly, the last micro-batch must drain through P1P-1 stages after the first stage finishes. The pipeline is fully used only during the middle MP+1M - P + 1 time slots.

Proof Sketch

Total time with MM micro-batches across PP stages is (M+P1)t(M + P - 1) \cdot t. Useful computation is MtM \cdot t per stage. Idle time per stage is (P1)t(P - 1) \cdot t. Bubble fraction: (P1)/(M+P1)(P-1)/(M + P - 1).

Why It Matters

This formula tells you the minimum number of micro-batches needed for efficient pipeline parallelism. With 8 stages and 5% target bubble overhead, you need M8×19+1=153M \geq 8 \times 19 + 1 = 153 micro-batches. This constrains the minimum effective batch size: pipeline parallelism imposes a batch size floor.

Failure Mode

The formula assumes equal computation time per stage. In practice, stages may be unbalanced (e.g., embedding layers are cheaper than transformer blocks). Unbalanced stages increase the effective bubble. The formula also ignores communication time between stages.

How Strategies Compose

Frontier training typically uses a 3D parallelism configuration:

  1. TP within a node: 8 GPUs connected by NVLink share each layer
  2. PP across groups of nodes: layers distributed across pipeline stages
  3. DP across remaining nodes: replicate the pipeline, split data

The total GPU count is Ntotal=NTP×NPP×NDPN_{\text{total}} = N_{\text{TP}} \times N_{\text{PP}} \times N_{\text{DP}}.

For MoE models, add expert parallelism: experts are distributed across the TP group, and an All-to-All routes tokens to the correct expert.

For long-context models, add sequence parallelism to avoid the O(L2)O(L^2) memory cost of attention on a single device.

Common Confusions

Watch Out

Data parallelism does not increase model size

DP replicates the model. Every GPU holds the entire model. To train a model that does not fit on one GPU, you need tensor or pipeline parallelism. DP increases throughput by processing more data per step, not by enabling larger models.

Watch Out

Tensor parallelism is not the same as model parallelism

Model parallelism is an umbrella term covering both tensor parallelism (splitting within layers) and pipeline parallelism (splitting across layers). These have very different communication patterns. TP requires synchronization at every layer forward and backward pass. PP only communicates activations between adjacent stages at micro-batch boundaries.

Watch Out

ZeRO is not a parallelism strategy

ZeRO (Zero Redundancy Optimizer) partitions optimizer states, gradients, and parameters across DP replicas. It is a memory optimization for data parallelism, not a separate parallelism dimension. ZeRO-3 (partitioning parameters) makes DP look like model parallelism from a memory perspective, but the communication pattern is different.

Summary

  • DP: replicate model, split data, AllReduce gradients. Cost independent of NN
  • TP: split layers, requires NVLink. Used within nodes (2-8 GPUs)
  • PP: split layers across stages, pipeline bubbles cost O((P1)/M)O((P-1)/M)
  • EP: route tokens to experts on different devices (MoE only)
  • SP: split sequences, needed for long context
  • Frontier training uses 3D parallelism: TP x PP x DP

Exercises

ExerciseCore

Problem

You have 256 GPUs, each with 80 GB memory. Your model has 70B parameters (140 GB in FP16). What is the minimum tensor parallelism degree needed? If you use TP=2 and PP=4, how many DP replicas can you have?

ExerciseAdvanced

Problem

You use pipeline parallelism with P=16P = 16 stages. What is the minimum number of micro-batches to keep bubble overhead below 10%? Below 1%?

References

Canonical:

  • Shoeybi et al., "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism" (2020)
  • Narayanan et al., "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM" (2021)

Current:

  • DeepSeek-AI, "DeepSeek-V3 Technical Report" (2024), Section on training infrastructure

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics