Parallel Processing Fundamentals

Sneiderman, Robby

LLM Construction

Parallel Processing Fundamentals

Data, tensor, pipeline, expert, and sequence parallelism: the five strategies for distributing model training and inference across multiple GPUs, and how frontier labs combine all of them.

AdvancedTier 2CurrentSupporting~50 min

Prerequisites

Stochastic Gradient Descent Convergence Broadcast Joins Distributed Compute Dask Parallel Python Ray Distributed Python

Prereq Map

Learning position

Read this page in the graph.

llm-construction | layer 5 | tier 2. This page has 4 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Distributed Training Theory

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

A single GPU cannot train a frontier language model. A 405B parameter model in FP16 requires 810 GB just for parameters, far exceeding any single GPU's memory. Even if memory were unlimited, training on trillions of tokens with one GPU would take years.

Parallelism is not optional. It is a requirement. The question is which combination of parallelism strategies minimizes training time while fitting in the available hardware. Understanding these strategies lets you read training infrastructure papers and reason about why certain model sizes and cluster configurations are chosen.

Parallelism Strategies

Definition

Data Parallelism (DP)

Replicate the entire model on $N$ devices. Split each mini-batch into $N$ micro-batches, one per device. Each device computes gradients on its shard. Synchronize gradients across devices via AllReduce. Update the replicated model.

Requirement: the full model must fit on a single device. Data parallelism scales the batch size by a factor of $N$ .

Definition

Tensor Parallelism (TP)

Split individual layers across devices. For a linear layer $Y = XW$ , partition $W$ column-wise across $N$ devices. Each device computes a slice of the output. An AllGather or ReduceScatter synchronizes the result.

Requires high-bandwidth interconnect (NVLink) because every layer incurs communication. Typically used within a single node (4-8 GPUs).

Definition

Pipeline Parallelism (PP)

Assign different layers to different devices. Device 1 runs layers 1-10, device 2 runs layers 11-20, and so on. Split the mini-batch into micro-batches. While device 2 processes micro-batch 1, device 1 processes micro-batch 2.

The problem: pipeline bubbles. At the start and end of each batch, some devices are idle.

Definition

Expert Parallelism (EP)

In Mixture-of-Experts models, different experts reside on different devices. A router sends each token to the appropriate device. Communication is All-to-All: each device sends tokens to the device hosting the selected expert.

Scales model capacity without proportionally scaling per-token compute.

Definition

Sequence Parallelism (SP)

Split long sequences across devices. Each device processes a contiguous chunk of the sequence. For attention, this requires communicating key-value pairs between devices. Ring attention is one implementation: devices pass KV blocks in a ring, computing partial attention scores at each step.

Main Theorems

Proposition

Ring AllReduce Communication Cost

Statement

Ring AllReduce synchronizes $N$ copies of a vector of size $M$ bytes in time:

$T_{\text{AllReduce}} = 2(N-1) \cdot \frac{M}{N \cdot B}$

This is $\approx 2M/B$ for large $N$ , independent of the number of devices. Each device sends and receives $2(N-1)M/N$ bytes total.

Intuition

The ring algorithm proceeds in two phases, each with $N-1$ steps. In each step, every device sends a chunk of size $M/N$ to its neighbor. After the first phase (reduce-scatter), each device holds the sum of one chunk. After the second phase (all-gather), every device holds the full result. Total bytes per device is $2(N-1)M/N$ .

Proof Sketch

In the reduce-scatter phase, $N-1$ rounds send $M/N$ bytes each. In the all-gather phase, another $N-1$ rounds send $M/N$ bytes each. Total data transmitted per device: $2(N-1)M/N$ . Latency: $2(N-1)$ rounds, each taking $M/(NB)$ seconds. This is bandwidth-optimal: the total data that must cross any bisection is $\Omega(M)$ .

Why It Matters

Ring AllReduce is the standard gradient synchronization primitive in data parallelism. Its key property is that communication cost scales as $O(M)$ independent of $N$ . Doubling the number of GPUs does not double the communication overhead. This enables near-linear scaling of data parallelism in practice.

Failure Mode

The $O(M)$ result assumes the ring bandwidth $B$ is constant. In practice, inter-node bandwidth (InfiniBand at 400 Gb/s) is much lower than intra-node bandwidth (NVLink at 900 GB/s). AllReduce across nodes is slower. Also, latency (not just bandwidth) matters for small messages: $T = 2(N-1)(\alpha + M/(NB))$ where $\alpha$ is per-message latency.

report a correction →

Proposition

Pipeline Bubble Fraction

Statement

In a pipeline with $P$ stages and $M$ micro-batches (using the 1F1B schedule), the bubble fraction (fraction of time GPUs are idle) is:

$\text{bubble} = \frac{P - 1}{M + P - 1}$

To keep bubble overhead below $\epsilon$ , use $M \geq (P-1)(1/\epsilon - 1)$ .

Intuition

The first stage starts immediately, but the last stage must wait for $P-1$ micro-batches to propagate. Similarly, the last micro-batch must drain through $P-1$ stages after the first stage finishes. The pipeline is fully used only during the middle $M - P + 1$ time slots.

Proof Sketch

Total time with $M$ micro-batches across $P$ stages is $(M + P - 1) \cdot t$ . Useful computation is $M \cdot t$ per stage. Idle time per stage is $(P - 1) \cdot t$ . Bubble fraction: $(P-1)/(M + P - 1)$ .

Why It Matters

This formula tells you the minimum number of micro-batches needed for efficient pipeline parallelism. With 8 stages and 5% target bubble overhead, you need $M \geq 7 \times 19 = 133$ micro-batches. This constrains the minimum effective batch size: pipeline parallelism imposes a batch size floor.

Failure Mode

The formula assumes equal computation time per stage. In practice, stages may be unbalanced (e.g., embedding layers are cheaper than transformer blocks). Unbalanced stages increase the effective bubble. The formula also ignores communication time between stages.

report a correction →

How Strategies Compose

Frontier training typically uses a 3D parallelism configuration:

TP within a node: 8 GPUs connected by NVLink share each layer
PP across groups of nodes: layers distributed across pipeline stages
DP across remaining nodes: replicate the pipeline, split data

The total GPU count is $N_{\text{total}} = N_{\text{TP}} \times N_{\text{PP}} \times N_{\text{DP}}$ .

For MoE models, add expert parallelism: experts are distributed across the TP group, and an All-to-All routes tokens to the correct expert.

For long-context models, add sequence parallelism to avoid the $O(L^2)$ memory cost of attention on a single device.

Common Confusions

Watch Out

Data parallelism does not increase model size

DP replicates the model. Every GPU holds the entire model. To train a model that does not fit on one GPU, you need tensor or pipeline parallelism. DP increases throughput by processing more data per step, not by enabling larger models.

Watch Out

Tensor parallelism is not the same as model parallelism

Model parallelism is an umbrella term covering both tensor parallelism (splitting within layers) and pipeline parallelism (splitting across layers). These have very different communication patterns. TP requires synchronization at every layer forward and backward pass. PP only communicates activations between adjacent stages at micro-batch boundaries.

Watch Out

ZeRO is not a parallelism strategy

ZeRO (Zero Redundancy Optimizer) partitions optimizer states, gradients, and parameters across DP replicas. It is a memory optimization for data parallelism, not a separate parallelism dimension. ZeRO-3 (partitioning parameters) makes DP look like model parallelism from a memory perspective, but the communication pattern is different.

Summary

DP: replicate model, split data, AllReduce gradients. Cost independent of $N$
TP: split layers, requires NVLink. Used within nodes (2-8 GPUs)
PP: split layers across stages, pipeline bubbles cost $O((P-1)/M)$
EP: route tokens to experts on different devices (MoE only)
SP: split sequences, needed for long context
Frontier training uses 3D parallelism: TP x PP x DP

Exercises

ExerciseCore

Problem

You have 256 GPUs, each with 80 GB memory. Your model has 70B parameters (140 GB in FP16). What is the minimum tensor parallelism degree needed? If you use TP=2 and PP=4, how many DP replicas can you have?

ExerciseAdvanced

Problem

You use pipeline parallelism with $P = 16$ stages. What is the minimum number of micro-batches to keep bubble overhead below 10%? Below 1%?

References

Canonical:

Patarasuk and Yuan, "Bandwidth optimal all-reduce algorithms for clusters of workstations" (2009). The original ring AllReduce analysis underpinning the $2M/B$ asymptotic.
Shoeybi et al., "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism" (arXiv:1909.08053, 2020). Tensor parallelism for transformers.
Huang et al., "GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism" (arXiv:1811.06965, 2019). Micro-batched pipeline parallelism.
Narayanan et al., "PipeDream: Generalized Pipeline Parallelism for DNN Training" (arXiv:1806.03377, 2019). 1F1B scheduling.
Rajbhandari et al., "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models" (arXiv:1910.02054, 2020). Partitioned data-parallel optimizer states, gradients, and parameters.
Narayanan et al., "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM" (arXiv:2104.04473, 2021). 3D parallelism composition.

MoE and long-context:

Lepikhin et al., "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding" (arXiv:2006.16668, 2020). Expert-parallel routing and All-to-All.
Fedus et al., "Switch Transformer: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity" (arXiv:2101.03961, 2021). Top-1 expert routing.
Liu et al., "Ring Attention with Blockwise Transformers for Near-Infinite Context" (arXiv:2310.01889, 2023). Sequence-parallel attention via KV ring.

Current:

DeepSeek-AI, "DeepSeek-V3 Technical Report" (arXiv:2412.19437, 2024). Training infrastructure section covers 3D parallelism at the frontier.

Next Topics

Distributed training theory: the communication lower bounds, ZeRO stages, and scaling law analyses that sit on top of these mechanics.
NVIDIA GPU architectures: the hardware these parallelism strategies run on.

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Stochastic Gradient Descent Convergencelayer 2 · tier 1
Broadcast Joins in Distributed Computelayer 4 · tier 3
Dask Parallel Pythonlayer 4 · tier 3
Ray Distributed Pythonlayer 4 · tier 3

Derived topics

2

Distributed Training Theorylayer 5 · tier 3
NVIDIA GPU Architectureslayer 5 · tier 3

Graph-backed continuations

Distributed Training Theory NVIDIA GPU Architectures