Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

AMD Competition Landscape

AMD's MI300X and MI325X GPUs compete with NVIDIA on memory bandwidth and capacity but lag on software ecosystem. Competition matters because pricing, supply diversity, and vendor lock-in determine who can train and serve models.

CoreTier 3Frontier~35 min

Prerequisites

0

Why This Matters

NVIDIA controls roughly 80-90% of the AI accelerator market, built on the GPU compute model. This concentration gives NVIDIA pricing power and creates supply bottlenecks. AMD is the primary alternative for data center AI compute. Whether AMD can credibly compete affects GPU prices, supply availability, and the risk of vendor lock-in for anyone training or serving large models. The underlying chip supply chain is detailed in ASML and chip manufacturing.

This is not a question of which chip is "better" in isolation. It is a question of market structure and its consequences for AI development.

Mental Model

GPU competition in AI comes down to three factors, roughly in order of importance:

  1. Software ecosystem: Can existing code run on the hardware with minimal changes?
  2. Memory capacity and bandwidth: How large a model can you serve, and how fast?
  3. Compute throughput: Raw FLOPS for matrix multiplications.

NVIDIA leads decisively on (1), AMD is competitive on (2), and both are competitive on (3). The software gap is the binding constraint.

Hardware Comparison

Definition

MI300X Specifications

AMD Instinct MI300X (launched late 2023):

  • 192GB HBM3 memory (vs. 80GB on H100 SXM)
  • 5.3 TB/s memory bandwidth (vs. 3.35 TB/s on H100)
  • 1307 TFLOPS BF16 peak (vs. 990 TFLOPS on H100)
  • 750W TDP
  • CDNA 3 architecture, chiplet design with 8 XCDs
Definition

MI325X Specifications

AMD Instinct MI325X (launched 2024):

  • 256GB HBM3E memory
  • 6.0 TB/s memory bandwidth
  • Similar compute to MI300X with architectural refinements
  • Targets inference workloads where memory capacity is the bottleneck

The MI300X has 2.4x the memory capacity and 1.6x the memory bandwidth of the H100. For inference of large models where the bottleneck is loading weights from HBM (memory-bandwidth-bound regime), more memory bandwidth directly translates to higher throughput.

The Roofline Perspective

Proposition

Memory-Bandwidth-Bound Regime

Statement

For a workload with arithmetic intensity II (FLOPS per byte of memory accessed), the achievable throughput is:

Throughput=min(I×BW,Peak FLOPS)\text{Throughput} = \min(I \times \text{BW}, \text{Peak FLOPS})

where BW\text{BW} is memory bandwidth. When I<Peak FLOPS/BWI < \text{Peak FLOPS} / \text{BW} (the ridge point), the workload is memory-bandwidth-bound and throughput scales linearly with bandwidth.

Intuition

LLM inference at small batch sizes is memory-bandwidth-bound (a pattern that interacts with scaling laws as models grow). Each token generation requires loading the entire model's weights from HBM. For a 70B parameter model in FP16 (140GB), generating one token loads 140GB from memory. If batch size is 1, the arithmetic intensity is roughly 1 FLOP/byte (one multiply-add per weight loaded). This is far below the ridge point of modern GPUs (roughly 150-300 FLOPS/byte), so throughput is entirely determined by memory bandwidth.

In this regime, the MI300X's 5.3 TB/s bandwidth gives it a direct advantage over the H100's 3.35 TB/s for single-stream inference latency.

Failure Mode

At large batch sizes, inference becomes compute-bound (arithmetic intensity increases because the same weights serve multiple sequences). In the compute-bound regime, raw FLOPS matter more than bandwidth, and NVIDIA's mature tensor core architecture and higher effective utilization (due to better software) often close or reverse the hardware advantage.

The Software Gap: CUDA vs. ROCm

The most important difference between AMD and NVIDIA for AI workloads is software.

CUDA ecosystem advantages:

  • 15+ years of libraries, tooling, and community knowledge
  • cuDNN, cuBLAS, TensorRT, NCCL are heavily optimized and battle-tested
  • Nearly all ML frameworks (PyTorch, JAX, TensorFlow) were developed CUDA-first
  • Third-party libraries (FlashAttention, vLLM, TensorRT-LLM) often launch CUDA-only
  • Profiling tools (Nsight, nvprof) are mature

ROCm ecosystem status:

  • PyTorch has official ROCm support; most standard training loops work
  • HIP is a CUDA-to-ROCm translation layer that handles most CUDA code
  • Custom CUDA kernels (FlashAttention, fused operations) require porting effort
  • Multi-GPU communication (RCCL vs. NCCL) is functional but less optimized
  • Profiling and debugging tools are less mature
  • Library coverage is narrower: not all CUDA libraries have ROCm equivalents

The practical consequence: running a standard PyTorch training loop on AMD GPUs works. Running a highly optimized inference stack (custom attention kernels, quantized operations, speculative decoding) requires significant engineering effort to port and tune.

Who Uses AMD GPUs

Several large-scale deployments use AMD MI300X:

  • Microsoft Azure offers MI300X instances and uses them internally
  • Meta has deployed MI300X clusters for training and inference
  • Oracle Cloud offers MI300X instances

These deployments validate that AMD hardware works at scale. But they also highlight that adoption requires dedicated software engineering teams to optimize the ROCm stack for specific workloads.

Why Competition Matters

The consequences of GPU market concentration:

  1. Pricing: With limited competition, NVIDIA can price H100/B200 systems at high margins. AMD's MI300X is priced lower per unit of memory and bandwidth.
  2. Supply: When NVIDIA allocates limited supply, organizations without large purchase commitments cannot access GPUs. AMD provides an alternative supply source.
  3. Vendor lock-in: Code written for CUDA does not trivially move to other platforms. Organizations that invest heavily in CUDA-specific optimizations face switching costs. This lock-in strengthens NVIDIA's position over time.
  4. Innovation pressure: Competition forces both vendors to improve. NVIDIA's rapid cadence (Hopper to Blackwell to Rubin) is partly a response to AMD's improving competitiveness.

Common Confusions

Watch Out

More HBM does not always mean faster inference

The MI300X has 192GB vs. H100's 80GB, but this matters only if your model needs more than 80GB. For models that fit on one H100 (e.g., 7B-13B models), the extra memory is unused. The bandwidth advantage is always relevant, but the capacity advantage is model-size-dependent.

Watch Out

Peak FLOPS comparisons are misleading

AMD and NVIDIA report peak FLOPS under different conditions (sparsity, data types, sustained vs. burst). The MI300X's 1307 TFLOPS BF16 and the H100's 990 TFLOPS BF16 are not directly comparable because sustained throughput depends on memory bandwidth, cache behavior, and software efficiency. Real-world kernel benchmarks are the only reliable comparison.

Watch Out

ROCm compatibility does not mean performance parity

A PyTorch model that runs on ROCm may achieve 60-80% of the performance of the same model on CUDA, even on hardware with comparable specs. The gap comes from less-optimized kernels, communication libraries, and memory management. The hardware may be competitive; the software is not yet at parity for all workloads.

Exercises

ExerciseCore

Problem

A 70B parameter model stored in BF16 requires 140GB of weight data. For single-batch autoregressive inference (one token at a time), estimate the maximum tokens per second on (a) H100 at 3.35 TB/s bandwidth and (b) MI300X at 5.3 TB/s bandwidth. Assume the workload is purely memory-bandwidth-bound.

ExerciseAdvanced

Problem

At what batch size does inference for a 70B BF16 model transition from memory-bandwidth-bound to compute-bound on an H100? Assume each token requires 2×70×1092 \times 70 \times 10^9 FLOPs (two FLOPs per parameter for a single forward pass) and the H100 sustains 500 TFLOPS BF16 (roughly half of peak).

References

Canonical:

  • Williams, Waterman, Patterson, "Roofline: An Insightful Visual Performance Model for Multicore Architectures" (CACM 2009)

Current:

  • AMD Instinct MI300X Whitepaper (2023)
  • AMD Instinct MI325X Datasheet (2024)
  • Patel, Afzal, "GPU Benchmarking for LLM Inference" (SemiAnalysis, 2024)

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.