Infrastructure
Running ML Workloads on GPUs
Operational reference for running deep-learning jobs on GPUs: driver/CUDA/cuDNN matrix, monitoring tools, parallelism strategies, FSDP and ZeRO stages, profiler basics.
Prerequisites
What It Is
Running an ML job on a GPU correctly requires a stack of compatible system software, an inter-GPU communication path, a parallelism strategy that matches the model size, and the ability to read a profiler when throughput drops below expected. This page is a reference for the moving parts; deeper treatments live on the linked theory pages.
The base stack is three layers: NVIDIA driver (kernel module, version like 535.x or 550.x), CUDA Toolkit (compiler, runtime, libraries; version like 12.4), and cuDNN (deep-learning primitives; version like 9.0). PyTorch and JAX wheels are built against a specific CUDA Toolkit minor version; mismatches surface as undefined symbol errors at import. The driver must be at least as new as the CUDA Toolkit major version requires, but newer drivers are forward-compatible with older Toolkits via the CUDA compatibility runtime.
Monitoring tools, in increasing order of detail:
nvidia-smi: per-GPU utilization, memory, power, processes. The 30-second-check default.nvtop: htop-style live view across all GPUs in one terminal.dcgmiand DCGM (Data Center GPU Manager): detailed per-job telemetry, suitable for cluster-wide monitoring with Prometheus.- PyTorch profiler (
torch.profiler.profile) or NVIDIA Nsight Systems for kernel-level traces.
When You'd Use It
For multi-GPU training, the communication library is almost always NCCL (NVIDIA Collective Communications Library), which implements all-reduce, all-gather, broadcast, and reduce-scatter optimized for the underlying interconnect. Topology matters: NVLink (intra-node, ~600-900 GB/s on H100) is roughly 10x faster than PCIe Gen 5 (~64 GB/s per direction). Cross-node traffic uses InfiniBand or RoCE at 200-400 Gbps. A misconfigured NCCL_SOCKET_IFNAME silently routes through Ethernet and tanks throughput.
Parallelism strategies, in order of use as model size grows:
- Data parallel (DP): replicate the full model on each GPU, split the batch. Works while the model fits in one GPU's memory.
- Tensor parallel (TP): split each weight matrix across GPUs (Megatron-LM style). Communication-heavy; needs NVLink within a node.
- Pipeline parallel (PP): split the layer stack across GPUs. Introduces bubble overhead handled by GPipe / 1F1B / interleaved schedules.
- Model parallel is a generic umbrella for any of TP / PP / sharded variants.
FSDP (Fully Sharded Data Parallel, the modern PyTorch default) shards parameters, gradients, and optimizer state across data-parallel ranks, gathering full weights only during forward and backward of each layer. This is functionally equivalent to DeepSpeed ZeRO Stage 3. ZeRO Stage 1 shards only optimizer state, Stage 2 adds gradient sharding, Stage 3 adds parameter sharding.
The PyTorch profiler output is two-column: each row shows kernel name, CUDA time, CPU time, and self/total occupancy. The two diagnostics worth memorizing: a long tail of cudaMemcpyAsync means host-device transfers are blocking the loop (move data prep to a DataLoader worker); a low GPU utilization with high CPU time means the data pipeline is the bottleneck, not the model.
Notable Gotchas
Out-of-memory does not mean the model is too big
PyTorch reserves a memory pool that is never returned to the driver. An OOM at step 4000 after 4000 successful steps usually means the activation memory grew (longer sequence in the batch, gradient checkpointing turned off, fragmentation from variable-length inputs) not that the model itself does not fit. Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True or max_split_size_mb=128 to reduce fragmentation before resorting to a smaller model.
FSDP and gradient accumulation interact badly
FSDP gathers and reshards parameters at each forward/backward. Naive gradient accumulation re-runs the gather on every microbatch, multiplying communication. Use the no_sync context manager on all microbatches except the last to defer gradient reduction; on FSDP this requires set_gradient_division=False and careful averaging.
References
Related Topics
Last reviewed: April 18, 2026
Prerequisites
Foundations this topic depends on.
- GPU Compute ModelLayer 5