Methodology
Hardware for ML Practitioners
Practical hardware guidance for ML work: GPU memory as the real bottleneck, when local GPUs make sense, cloud options compared, and why you should not spend $5000 before knowing what you need.
Why This Matters
Your hardware choices determine what experiments you can run, how fast you iterate, and how much you spend. Bad hardware decisions waste money. Good ones let you focus on the actual ML problem.
The single most important fact: GPU memory (VRAM), not raw compute, is the bottleneck for most ML workloads. A model that does not fit in VRAM cannot run, regardless of how fast the GPU is.
Mental Model
Think of ML hardware in three tiers:
- Development machine: where you write code, debug, run small tests
- Training compute: where you run real experiments on real data
- Inference hardware: where you serve a trained model
Most practitioners need a good laptop for tier 1 and cloud access for tier 2. Tier 3 depends on your deployment context.
The GPU Memory Bottleneck
VRAM Bottleneck
A model with parameters in 16-bit floating point requires approximately bytes of VRAM just to store the weights. During training with Adam, you need roughly bytes total (weights, gradients, optimizer states). A model can only train on a GPU if the total memory requirement fits in VRAM.
Concrete numbers for a 7B parameter model at fp16:
- Weight storage: GB
- Training with Adam: GB
A consumer RTX 3090 has 24 GB VRAM. You can run inference on a 7B model (with quantization), but full fine-tuning requires multi-GPU setups or parameter-efficient methods like LoRA.
Training Memory Lower Bound
Statement
Training a model with parameters using Adam in mixed precision requires at least bytes of GPU memory: for fp16 weights, for fp32 master weights, for first moment, for second moment, and for gradients.
Intuition
Adam maintains two running averages (momentum and squared gradients) per parameter, both in fp32. Add the master weights in fp32 and the fp16 working copies, and memory usage scales at roughly 8x the raw model size in fp16.
Proof Sketch
Count the buffers: fp16 weights (), fp32 master weights (), fp32 first moment (), fp32 second moment (), fp16 gradients (). Sum: bytes. Activation memory adds further cost that depends on batch size and sequence length.
Why It Matters
This bound determines which models fit on which GPUs. An A100 with 80 GB can train models up to about 5B parameters with Adam in mixed precision. Larger models require model parallelism, gradient checkpointing, or parameter-efficient methods.
Failure Mode
This estimate ignores activation memory, which can dominate for large batch sizes and long sequences. Activation checkpointing trades compute for memory, reducing activation cost at the expense of recomputation.
Consumer GPU Comparison
| GPU | VRAM | Approximate Cost | Good For |
|---|---|---|---|
| RTX 3090 | 24 GB | $800 used | Small experiments, inference |
| RTX 4090 | 24 GB | $1600 new | Same VRAM, faster compute |
| A100 (cloud) | 40/80 GB | $1-2/hr | Serious training |
| H100 (cloud) | 80 GB | $2-4/hr | Large-scale training |
The 3090 and 4090 have identical VRAM. The 4090 is faster but does not let you train larger models. VRAM is the hard constraint.
When Local GPU Makes Sense
Local hardware is justified when:
- You run many small experiments daily (the per-hour cloud cost adds up)
- You have privacy requirements that prohibit cloud uploads
- You primarily do inference on models that fit in 24 GB
- You need low-latency iteration cycles (no cloud startup overhead)
Local hardware is not justified when:
- You do not yet know what workloads you will run
- You need more than 24 GB VRAM (multi-GPU consumer setups are fragile)
- You train infrequently (a few times per month)
- You need to scale up and down based on project demands
Cloud Options
Colab (Google): Free tier gives a T4 (16 GB VRAM) with time limits. Good for learning, bad for real work. Pro tier ($10/month) gives better GPUs but still has session limits.
RunPod: On-demand GPU instances. Good prices for A100 and H100. No commitment. Pay by the hour. Good for individual practitioners.
Lambda Cloud: Reserved instances at competitive rates. Better for teams that need consistent capacity.
AWS/GCP/Azure: Enterprise options with the full ecosystem (storage, networking, MLOps tools). Higher prices, more overhead, better for organizations with existing cloud infrastructure.
Operating System
Linux is the default for serious ML work. CUDA, PyTorch, and most ML tooling is Linux-first. Docker containers assume Linux. If you use cloud GPUs, you are using Linux.
For local development, macOS (especially Apple Silicon) works well. Apple's M-series chips have unified memory shared between CPU and GPU, which means a MacBook Pro with 36 GB RAM can run inference on models up to about 30B parameters (quantized). This is excellent for development, prototyping, and running local LLMs.
Windows works but creates friction: path handling, environment management, and CUDA configuration are all harder than on Linux or macOS.
Common Confusions
More FLOPS does not mean you can train larger models
A GPU with higher compute throughput but less VRAM cannot train models that do not fit in memory. The RTX 4090 is about 2x faster than the 3090 in raw FLOPS, but both have 24 GB VRAM. They can train the same models; the 4090 just finishes faster.
Apple Silicon is not a replacement for NVIDIA GPUs
Apple M-series chips run PyTorch via the MPS backend, but training is significantly slower than CUDA on NVIDIA GPUs. Apple Silicon is excellent for inference and development, not for large-scale training.
Exercises
Problem
A model has 3B parameters. You want to fine-tune it with Adam in mixed precision. How much VRAM do you need? Can you do this on an RTX 3090?
Problem
You are a solo researcher who runs 2-3 training jobs per week, each taking 4-6 hours on an A100. Should you buy a local GPU or use cloud compute? Estimate monthly costs for each option.
References
Canonical:
- NVIDIA CUDA Programming Guide, Memory Management sections
- Rajbhandari et al., "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models" (2020), Section 2
Current:
-
Hugging Face, "Model Memory Anatomy" documentation (2024)
-
Lambda Labs GPU Benchmarks (updated regularly)
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8
-
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14
Next Topics
- ML project lifecycle: the full workflow from problem definition to deployment
- Experiment tracking and tooling: managing runs, hyperparameters, and results
Last reviewed: April 2026