Skip to main content

Infrastructure

CUDA Programming Fundamentals

Reference for CUDA C++ kernels: thread hierarchy, shared memory, coalescing, warp primitives, NVCC, and when to drop to Triton or CUTLASS instead.

CoreTier 3Current~12 min

Prerequisites

0

What It Is

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform, released in 2007. It exposes the GPU as a massively parallel coprocessor through a C++ language extension plus a runtime library. The compiled output runs on NVIDIA hardware only.

A CUDA program splits into host code (runs on the CPU, written in normal C++) and device code (runs on the GPU, written in CUDA C++ inside functions tagged with __global__ or __device__). The function-qualifier set:

  • __global__: a kernel callable from host, executed on device, returns void.
  • __device__: callable from device only, executed on device.
  • __host__: callable from host only, executed on host (the default; useful when combined with __device__ to compile the same function for both).

Device code is launched with the triple-bracket syntax kernel<<<grid, block>>>(args) where grid is the number of thread blocks and block is threads per block. Threads inside a block execute on the same Streaming Multiprocessor (SM), share an L1-cache-sized region called shared memory, and can synchronize via __syncthreads(). Threads in a block are scheduled in groups of 32 called warps; all threads in a warp execute the same instruction at the same time (SIMT execution), so divergent branches inside a warp serialize.

NVCC is the CUDA compiler driver; it splits a .cu file into host C++ (passed to gcc/clang) and device PTX/SASS (NVIDIA's assembly). PTX is a virtual ISA, JIT-compiled to SASS for the target compute capability at load time.

When You'd Use It

Write raw CUDA C++ when you need a kernel that is not yet covered by PyTorch / cuDNN / cuBLAS, when an existing library is leaving 30%+ on the table, or when fusing operations to avoid memory round-trips. Three optimization themes dominate: memory coalescing (adjacent threads should read adjacent addresses so a warp issues one wide load), shared memory tiling (cache reused operands in shared memory to avoid global-memory traffic), and warp-level primitives (__shfl_sync, __ballot_sync for cross-thread reductions without shared memory).

For most modern projects, prefer Triton (OpenAI, 2021): a Python DSL that compiles to PTX with autotuning over block sizes. Triton produces near-cuBLAS performance on matmul and Flash-Attention-style kernels with one-tenth the code. CUTLASS (NVIDIA's C++ template library) is the right choice when you need cuBLAS-grade GEMM but with a custom epilogue (e.g. fused activation, quantization). Drop to raw CUDA only when neither covers your case.

For AMD hardware, ROCm/HIP mirrors the CUDA API closely; HIP code is mostly source-compatible (hipify is a sed-style translator). Performance parity has narrowed since MI300, but the ecosystem (libraries, profiling tools, kernel zoo) lags NVIDIA significantly as of 2026.

Notable Gotchas

Watch Out

Warp divergence costs more than people expect

An if branch where some threads in a warp take one path and others take the other forces the hardware to execute both paths sequentially while masking inactive lanes. A 50/50 split inside a hot loop can halve throughput. Restructure data so divergence happens at warp boundaries, not inside warps.

Watch Out

Shared memory is per-block, not per-grid

Shared memory has block scope and dies when the block finishes. There is no kernel-wide scratchpad. For inter-block communication you must round-trip through global memory or launch a follow-up kernel. Cooperative groups (CUDA 9+) provide grid-wide sync only on devices that support it and only when the kernel is launched cooperatively.

References

Related Topics

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Next Topics