Running ML Workloads on GPUs

Sneiderman, Robby

Infrastructure

Running ML Workloads on GPUs

Operational reference for running deep-learning jobs on GPUs: driver/CUDA/cuDNN matrix, monitoring tools, parallelism strategies, FSDP and ZeRO stages, profiler basics.

CoreTier 3CurrentReference~12 min

Prerequisites

GPU Compute Model

Prereq Map

Learning position

Read this page in the graph.

infrastructure | layer 4 | tier 3. This page has 1 direct prerequisite and 3 published dependents.

Open Atlas Prerequisites Leads to

What next

Distributed Training Theory

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

What It Is

Running an ML job on a GPU correctly requires a stack of compatible system software, an inter-GPU communication path, a parallelism strategy that matches the model size, and the ability to read a profiler when throughput drops below expected. This page is a reference for the moving parts; deeper treatments live on the linked theory pages.

The base stack is three layers: NVIDIA driver (kernel module, version like 535.x or 550.x), CUDA Toolkit (compiler, runtime, libraries; version like 12.4), and cuDNN (deep-learning primitives; version like 9.0). PyTorch and JAX wheels are built against a specific CUDA Toolkit minor version; mismatches surface as undefined symbol errors at import. The driver must be at least as new as the CUDA Toolkit major version requires, but newer drivers are forward-compatible with older Toolkits via the CUDA compatibility runtime.

Monitoring tools, in increasing order of detail:

nvidia-smi: per-GPU utilization, memory, power, processes. The 30-second-check default.
nvtop: htop-style live view across all GPUs in one terminal.
dcgmi and DCGM (Data Center GPU Manager): detailed per-job telemetry, suitable for cluster-wide monitoring with Prometheus.
PyTorch profiler (torch.profiler.profile) or NVIDIA Nsight Systems for kernel-level traces.

When You'd Use It

For multi-GPU training, the communication library is almost always NCCL (NVIDIA Collective Communications Library), which implements all-reduce, all-gather, broadcast, and reduce-scatter optimized for the underlying interconnect. Topology matters: NVLink (intra-node, ~600-900 GB/s on H100) is roughly 10x faster than PCIe Gen 5 (~64 GB/s per direction). Cross-node traffic uses InfiniBand or RoCE at 200-400 Gbps. A misconfigured NCCL_SOCKET_IFNAME silently routes through Ethernet and tanks throughput.

Parallelism strategies, in order of use as model size grows:

Data parallel (DP): replicate the full model on each GPU, split the batch. Works while the model fits in one GPU's memory.
Tensor parallel (TP): split each weight matrix across GPUs (Megatron-LM style). Communication-heavy; needs NVLink within a node.
Pipeline parallel (PP): split the layer stack across GPUs. Introduces bubble overhead handled by GPipe / 1F1B / interleaved schedules.
Model parallel is a generic umbrella for any of TP / PP / sharded variants.

FSDP (Fully Sharded Data Parallel, the modern PyTorch default) shards parameters, gradients, and optimizer state across data-parallel ranks, gathering full weights only during forward and backward of each layer. This is functionally equivalent to DeepSpeed ZeRO Stage 3. ZeRO Stage 1 shards only optimizer state, Stage 2 adds gradient sharding, Stage 3 adds parameter sharding.

The PyTorch profiler output is two-column: each row shows kernel name, CUDA time, CPU time, and self/total occupancy. The two diagnostics worth memorizing: a long tail of cudaMemcpyAsync means host-device transfers are blocking the loop (move data prep to a DataLoader worker); a low GPU utilization with high CPU time means the data pipeline is the bottleneck, not the model.

Notable Gotchas

Watch Out

Out-of-memory does not mean the model is too big

PyTorch reserves a memory pool that is never returned to the driver. An OOM at step 4000 after 4000 successful steps usually means the activation memory grew (longer sequence in the batch, gradient checkpointing turned off, fragmentation from variable-length inputs) not that the model itself does not fit. Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True or max_split_size_mb=128 to reduce fragmentation before resorting to a smaller model.

Watch Out

FSDP and gradient accumulation interact badly

FSDP gathers and reshards parameters at each forward/backward. Naive gradient accumulation re-runs the gather on every microbatch, multiplying communication. Use the no_sync context manager on all microbatches except the last to defer gradient reduction; on FSDP this requires set_gradient_division=False and careful averaging.

References

NVIDIA CUDA Compatibility Documentation (https://docs.nvidia.com/deploy/cuda-compatibility/).
PyTorch FSDP Tutorial (https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html).
DeepSpeed ZeRO Documentation (https://www.deepspeed.ai/tutorials/zero/).
Rajbhandari, S. et al. "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models," SC 2020, arXiv:1910.02054.
Shoeybi, M. et al. "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism," arXiv:1909.08053.
NCCL Developer Guide (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html).

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics