LLM Construction
AMD Competition Landscape
AMD's MI300X and MI325X GPUs compete with NVIDIA on memory bandwidth and capacity but lag on software ecosystem. Competition matters because pricing, supply diversity, and vendor lock-in determine who can train and serve models.
Prerequisites
Why This Matters
NVIDIA controls roughly 80-90% of the AI accelerator market, built on the GPU compute model. This concentration gives NVIDIA pricing power and creates supply bottlenecks. AMD is the primary alternative for data center AI compute. Whether AMD can credibly compete affects GPU prices, supply availability, and the risk of vendor lock-in for anyone training or serving large models. The underlying chip supply chain is detailed in ASML and chip manufacturing.
This is not a question of which chip is "better" in isolation. It is a question of market structure and its consequences for AI development.
Mental Model
GPU competition in AI comes down to three factors, roughly in order of importance:
- Software ecosystem: Can existing code run on the hardware with minimal changes?
- Memory capacity and bandwidth: How large a model can you serve, and how fast?
- Compute throughput: Raw FLOPS for matrix multiplications.
NVIDIA leads decisively on (1), AMD is competitive on (2), and both are competitive on (3). The software gap is the binding constraint.
Hardware Comparison
MI300X Specifications
AMD Instinct MI300X (launched late 2023):
- 192GB HBM3 memory (vs. 80GB on H100 SXM)
- 5.3 TB/s memory bandwidth (vs. 3.35 TB/s on H100)
- 1307 TFLOPS BF16 peak (vs. 990 TFLOPS on H100)
- 750W TDP
- CDNA 3 architecture, chiplet design with 8 XCDs
MI325X Specifications
AMD Instinct MI325X (launched 2024):
- 256GB HBM3E memory
- 6.0 TB/s memory bandwidth
- Similar compute to MI300X with architectural refinements
- Targets inference workloads where memory capacity is the bottleneck
The MI300X has 2.4x the memory capacity and 1.6x the memory bandwidth of the H100. For inference of large models where the bottleneck is loading weights from HBM (memory-bandwidth-bound regime), more memory bandwidth directly translates to higher throughput.
The Roofline Perspective
Memory-Bandwidth-Bound Regime
Statement
For a workload with arithmetic intensity (FLOPS per byte of memory accessed), the achievable throughput is:
where is memory bandwidth. When (the ridge point), the workload is memory-bandwidth-bound and throughput scales linearly with bandwidth.
Intuition
LLM inference at small batch sizes is memory-bandwidth-bound (a pattern that interacts with scaling laws as models grow). Each token generation requires loading the entire model's weights from HBM. For a 70B parameter model in FP16 (140GB), generating one token loads 140GB from memory. If batch size is 1, the arithmetic intensity is roughly 1 FLOP/byte (one multiply-add per weight loaded). This is far below the ridge point of modern GPUs (roughly 150-300 FLOPS/byte), so throughput is entirely determined by memory bandwidth.
In this regime, the MI300X's 5.3 TB/s bandwidth gives it a direct advantage over the H100's 3.35 TB/s for single-stream inference latency.
Failure Mode
At large batch sizes, inference becomes compute-bound (arithmetic intensity increases because the same weights serve multiple sequences). In the compute-bound regime, raw FLOPS matter more than bandwidth, and NVIDIA's mature tensor core architecture and higher effective utilization (due to better software) often close or reverse the hardware advantage.
The Software Gap: CUDA vs. ROCm
The most important difference between AMD and NVIDIA for AI workloads is software.
CUDA ecosystem advantages:
- 15+ years of libraries, tooling, and community knowledge
- cuDNN, cuBLAS, TensorRT, NCCL are heavily optimized and battle-tested
- Nearly all ML frameworks (PyTorch, JAX, TensorFlow) were developed CUDA-first
- Third-party libraries (FlashAttention, vLLM, TensorRT-LLM) often launch CUDA-only
- Profiling tools (Nsight, nvprof) are mature
ROCm ecosystem status:
- PyTorch has official ROCm support; most standard training loops work
- HIP is a CUDA-to-ROCm translation layer that handles most CUDA code
- Custom CUDA kernels (FlashAttention, fused operations) require porting effort
- Multi-GPU communication (RCCL vs. NCCL) is functional but less optimized
- Profiling and debugging tools are less mature
- Library coverage is narrower: not all CUDA libraries have ROCm equivalents
The practical consequence: running a standard PyTorch training loop on AMD GPUs works. Running a highly optimized inference stack (custom attention kernels, quantized operations, speculative decoding) requires significant engineering effort to port and tune.
Who Uses AMD GPUs
Several large-scale deployments use AMD MI300X:
- Microsoft Azure offers MI300X instances and uses them internally
- Meta has deployed MI300X clusters for training and inference
- Oracle Cloud offers MI300X instances
These deployments validate that AMD hardware works at scale. But they also highlight that adoption requires dedicated software engineering teams to optimize the ROCm stack for specific workloads.
Why Competition Matters
The consequences of GPU market concentration:
- Pricing: With limited competition, NVIDIA can price H100/B200 systems at high margins. AMD's MI300X is priced lower per unit of memory and bandwidth.
- Supply: When NVIDIA allocates limited supply, organizations without large purchase commitments cannot access GPUs. AMD provides an alternative supply source.
- Vendor lock-in: Code written for CUDA does not trivially move to other platforms. Organizations that invest heavily in CUDA-specific optimizations face switching costs. This lock-in strengthens NVIDIA's position over time.
- Innovation pressure: Competition forces both vendors to improve. NVIDIA's rapid cadence (Hopper to Blackwell to Rubin) is partly a response to AMD's improving competitiveness.
Common Confusions
More HBM does not always mean faster inference
The MI300X has 192GB vs. H100's 80GB, but this matters only if your model needs more than 80GB. For models that fit on one H100 (e.g., 7B-13B models), the extra memory is unused. The bandwidth advantage is always relevant, but the capacity advantage is model-size-dependent.
Peak FLOPS comparisons are misleading
AMD and NVIDIA report peak FLOPS under different conditions (sparsity, data types, sustained vs. burst). The MI300X's 1307 TFLOPS BF16 and the H100's 990 TFLOPS BF16 are not directly comparable because sustained throughput depends on memory bandwidth, cache behavior, and software efficiency. Real-world kernel benchmarks are the only reliable comparison.
ROCm compatibility does not mean performance parity
A PyTorch model that runs on ROCm may achieve 60-80% of the performance of the same model on CUDA, even on hardware with comparable specs. The gap comes from less-optimized kernels, communication libraries, and memory management. The hardware may be competitive; the software is not yet at parity for all workloads.
Exercises
Problem
A 70B parameter model stored in BF16 requires 140GB of weight data. For single-batch autoregressive inference (one token at a time), estimate the maximum tokens per second on (a) H100 at 3.35 TB/s bandwidth and (b) MI300X at 5.3 TB/s bandwidth. Assume the workload is purely memory-bandwidth-bound.
Problem
At what batch size does inference for a 70B BF16 model transition from memory-bandwidth-bound to compute-bound on an H100? Assume each token requires FLOPs (two FLOPs per parameter for a single forward pass) and the H100 sustains 500 TFLOPS BF16 (roughly half of peak).
References
Canonical:
- Williams, Waterman, Patterson, "Roofline: An Insightful Visual Performance Model for Multicore Architectures" (CACM 2009)
Current:
- AMD Instinct MI300X Whitepaper (2023)
- AMD Instinct MI325X Datasheet (2024)
- Patel, Afzal, "GPU Benchmarking for LLM Inference" (SemiAnalysis, 2024)
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- GPU Compute ModelLayer 5