CUDA Programming Fundamentals

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform, released in 2007. It exposes the GPU as a massively parallel coprocessor through a C++ language extension plus a runtime library. The compiled output runs on NVIDIA hardware only.

A CUDA program splits into host code (runs on the CPU, written in normal C++) and device code (runs on the GPU, written in CUDA C++ inside functions tagged with __global__ or __device__). The function-qualifier set:

__global__: a kernel callable from host, executed on device, returns void.
__device__: callable from device only, executed on device.
__host__: callable from host only, executed on host (the default; useful when combined with __device__ to compile the same function for both).

Device code is launched with the triple-bracket syntax kernel<<<grid, block>>>(args) where grid is the number of thread blocks and block is threads per block. Threads inside a block execute on the same Streaming Multiprocessor (SM), share an L1-cache-sized region called shared memory, and can synchronize via __syncthreads(). Threads in a block are scheduled in groups of 32 called warps; all threads in a warp execute the same instruction at the same time (SIMT execution), so divergent branches inside a warp serialize.

NVCC is the CUDA compiler driver; it splits a .cu file into host C++ (passed to gcc/clang) and device PTX/SASS (NVIDIA's assembly). PTX is a virtual ISA, JIT-compiled to SASS for the target compute capability at load time.

When You'd Use It

Write raw CUDA C++ when you need a kernel that is not yet covered by PyTorch / cuDNN / cuBLAS, when an existing library is leaving 30%+ on the table, or when fusing operations to avoid memory round-trips. Three optimization themes dominate: memory coalescing (adjacent threads should read adjacent addresses so a warp issues one wide load), shared memory tiling (cache reused operands in shared memory to avoid global-memory traffic), and warp-level primitives (__shfl_sync, __ballot_sync for cross-thread reductions without shared memory).

For most modern projects, prefer Triton (OpenAI, 2021): a Python DSL that compiles to PTX with autotuning over block sizes. Triton produces near-cuBLAS performance on matmul and Flash-Attention-style kernels with one-tenth the code. CUTLASS (NVIDIA's C++ template library) is the right choice when you need cuBLAS-grade GEMM but with a custom epilogue (e.g. fused activation, quantization). Drop to raw CUDA only when neither covers your case.

For AMD hardware, ROCm/HIP mirrors the CUDA API closely; HIP code is mostly source-compatible (hipify is a sed-style translator). Performance parity has narrowed since MI300, but the ecosystem (libraries, profiling tools, kernel zoo) lags NVIDIA significantly as of 2026.

Notable Gotchas

Watch Out

Warp divergence costs more than people expect

An if branch where some threads in a warp take one path and others take the other forces the hardware to execute both paths sequentially while masking inactive lanes. A 50/50 split inside a hot loop can halve throughput. Restructure data so divergence happens at warp boundaries, not inside warps.

Watch Out

Shared memory is per-block, not per-grid

Shared memory has block scope and dies when the block finishes. There is no kernel-wide scratchpad. For inter-block communication you must round-trip through global memory or launch a follow-up kernel. Cooperative groups (CUDA 9+) provide grid-wide sync only on devices that support it and only when the kernel is launched cooperatively.

References

CUDA C++ Programming Guide (https://docs.nvidia.com/cuda/cuda-c-programming-guide/).
Kirk, D. and Hwu, W. "Programming Massively Parallel Processors," 4th ed., Chapters 2-5 (thread hierarchy, memory model, performance).
Triton Documentation (https://triton-lang.org/main/index.html).
CUTLASS GitHub repository (https://github.com/NVIDIA/cutlass).
AMD HIP Programming Guide (https://rocm.docs.amd.com/projects/HIP/en/latest/).
Harris, M. "An Easy Introduction to CUDA C and C++," NVIDIA Developer Blog (2017), still the canonical first-contact tutorial.

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

GPU Compute Modellayer 5 · tier 2

Derived topics

3

Flash Attentionlayer 5 · tier 2
Fused Kernelslayer 5 · tier 2
NVIDIA GPU Architectureslayer 5 · tier 3

Graph-backed continuations

NVIDIA GPU Architectures Fused Kernels Flash Attention

Read this page in the graph.

What It Is