Why This Matters
A CUDA warp contains 32 threads, and the scheduler issues one instruction for that warp at a time. If 16 lanes take one branch and 16 lanes take another, the warp executes both paths with inactive lanes masked off.
For transformer kernels, the difference between a coalesced load and a strided load is often the difference between saturating HBM bandwidth and spending most cycles waiting. A single warp loading 32 adjacent float values reads 128 useful bytes. The same warp loading every 32nd float touches 32 separated cache sectors, wasting bandwidth before any matrix multiply starts.
Core Definitions
Thread
A CUDA thread is one execution instance of a kernel. Each thread has its own registers, program counter, local variables, and built-in indices such as threadIdx.x, blockIdx.x, and blockDim.x.
Warp
A warp is a group of 32 threads scheduled together on an NVIDIA GPU. The warp is the scheduling unit; a thread is the programming unit. Branch divergence inside one warp serializes the taken paths.
Block
A block is a group of threads that can cooperate through __shared__ memory and synchronize with __syncthreads(). One block is assigned to one streaming multiprocessor, abbreviated SM, for its lifetime.
Grid
A grid is the full set of blocks launched by one kernel call. Blocks in the same grid cannot synchronize with each other inside an ordinary kernel launch.
Kernel
A kernel is a C++ function marked __global__ and executed once per thread in the launched grid. The host launches it with kernel<<<gridDim, blockDim, sharedBytes, stream>>>(args).
CUDA Function Qualifiers
__global__ marks a kernel callable from host code and executed on the device. __device__ marks a function callable from device code and executed on the device. __host__ marks a function callable from host code and executed on the CPU. A function can be both __host__ __device__ when both versions are valid.
Execution Hierarchy by Index Arithmetic
A kernel body runs for every thread in the grid. The usual one-dimensional mapping computes one global element index per thread:
__global__ void saxpy(int n, float a, const float* x, float* y) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
y[i] = a * x[i] + y[i];
}
}
void launch(int n, float a, const float* x, float* y) {
int block = 256;
int grid = (n + block - 1) / block;
saxpy<<<grid, block>>>(n, a, x, y);
}
If n = 1000 and block = 256, then grid = 4. The launch creates 4 * 256 = 1024 threads. Threads with i = 1000 through 1023 run the bounds check and perform no store. Without that check, the last 24 threads write past the end of y.
For a two-dimensional tensor, use the same idea on both axes:
__global__ void fill2d(float* a, int rows, int cols) {
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
if (row < rows && col < cols) {
a[row * cols + col] = 1.0f;
}
}
With blockDim = (16, 16) and gridDim = (ceil(cols/16), ceil(rows/16)), each block has 256 threads. Thread (threadIdx.y=3, threadIdx.x=5) in block (blockIdx.y=2, blockIdx.x=7) writes row 2*16+3 = 35 and column 7*16+5 = 117.
Function Qualifiers and Call Boundaries
A __global__ function is a kernel. It returns void and is launched from host code. A __device__ helper runs on the GPU and can be called by a kernel. A __host__ function runs on the CPU. Plain C++ functions are __host__ by default.
__device__ float relu_device(float x) {
return x > 0.0f ? x : 0.0f;
}
__host__ __device__ float square_both(float x) {
return x * x;
}
__global__ void relu_kernel(const float* in, float* out, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
out[i] = relu_device(in[i]);
}
}
The host cannot call relu_device directly. Device code cannot call ordinary host-only C++ that performs file I/O or uses most CPU-side library calls. Keeping this boundary explicit prevents a common porting error: copying CPU helper functions into a kernel and expecting the compiler to generate device code for them.
Memory Hierarchy and Coalesced Access
CUDA exposes a hierarchy from private, small storage to large, high-latency storage. Registers are per thread. Shared memory is per block and programmer managed. L1 and L2 caches serve loads and stores. Global memory resides in HBM and is visible across the device.
A single warp reading 32 adjacent float values has this byte pattern when base address is 256-byte aligned:
| Lane | Element | Byte Address |
|---|---|---|
| 0 | x[0] | base + 0 |
| 1 | x[1] | base + 4 |
| 2 | x[2] | base + 8 |
| 31 | x[31] | base + 124 |
The useful data occupy bytes [base, base+127], exactly 128 bytes. This is the access pattern to prefer for row-major arrays when threadIdx.x maps to the fastest-changing dimension.
Now compare a strided read:
| Lane | Element | Byte Address |
|---|---|---|
| 0 | x[0] | base + 0 |
| 1 | x[32] | base + 128 |
| 2 | x[64] | base + 256 |
| 31 | x[992] | base + 3968 |
The warp still consumes 128 useful bytes, but they are spread across 3968 bytes of address range. Cache lines and memory transactions carry data that most lanes do not use. Matrix layouts and thread mappings should make adjacent lanes access adjacent words whenever possible.
Shared-Memory Tiling for Matrix Multiply
For C = A * B, a naive kernel loads A[row, k] and B[k, col] from global memory for every multiply. A tiled kernel loads a square tile of A and B into shared memory, reuses them for several multiply-adds, then advances to the next tile.
For TILE = 16, one block has 16 * 16 = 256 threads. Each tile of float values uses 16 * 16 * 4 = 1024 bytes. Two tiles use 2048 bytes of shared memory per block.
#define TILE 16
__global__ void matmul_tiled(const float* A, const float* B, float* C,
int M, int N, int K) {
__shared__ float As[TILE][TILE];
__shared__ float Bs[TILE][TILE];
int row = blockIdx.y * TILE + threadIdx.y;
int col = blockIdx.x * TILE + threadIdx.x;
float acc = 0.0f;
for (int t = 0; t < (K + TILE - 1) / TILE; ++t) {
int a_col = t * TILE + threadIdx.x;
int b_row = t * TILE + threadIdx.y;
As[threadIdx.y][threadIdx.x] =
(row < M && a_col < K) ? A[row * K + a_col] : 0.0f;
Bs[threadIdx.y][threadIdx.x] =
(b_row < K && col < N) ? B[b_row * N + col] : 0.0f;
__syncthreads();
for (int k = 0; k < TILE; ++k) {
acc += As[threadIdx.y][k] * Bs[k][threadIdx.x];
}
__syncthreads();
}
if (row < M && col < N) {
C[row * N + col] = acc;
}
}
The first __syncthreads() guarantees that all 256 loads into As and Bs finish before any thread consumes the tile. The second prevents a fast warp from overwriting As for tile t+1 while another warp is still reading tile t.
For one 16 by 16 output tile and one k tile, the block performs 16 * 16 * 16 = 4096 multiply-add steps after loading 2048 bytes from global memory, ignoring writeback. The point of tiling is reuse: each loaded A value is used by 16 output columns, and each loaded B value is used by 16 output rows.
Divergence, Occupancy, and Scheduling Limits
Divergence occurs when lanes in one warp take different control-flow paths. This code splits a warp by lane parity:
__global__ void branchy(float* y, const float* x) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if ((threadIdx.x & 1) == 0) {
y[i] = x[i] * 2.0f;
} else {
y[i] = x[i] + 7.0f;
}
}
For a full warp, 16 lanes take the then path and 16 take the else path. The warp executes both instruction sequences with masks. This is not two independent half-warps running at the same time.
Occupancy is the fraction of an SM's maximum resident warps or threads that are active. It is bounded by per-SM limits on threads, blocks, registers, and shared memory. Suppose an SM supports 2048 resident threads, 32 resident blocks, 65536 registers, and 100 KiB shared memory. A kernel uses 256 threads per block, 40 registers per thread, and 20 KiB shared memory per block.
Registers per block are 256 * 40 = 10240. Register capacity permits floor(65536 / 10240) = 6 blocks. Shared memory permits floor(100 / 20) = 5 blocks. Thread capacity permits floor(2048 / 256) = 8 blocks. The block limit permits 32. The true resident block count is min(6, 5, 8, 32) = 5, or 5 * 256 = 1280 resident threads. Thread occupancy is 1280 / 2048 = 0.625.
High occupancy can hide memory latency, but it is not the final objective. A kernel using more registers may run faster if it avoids repeated global loads or spills.
Streams, Async Copies, and Libraries
A CUDA stream is an ordered command queue. Operations in the same stream execute in issue order. Operations in different streams can overlap when the device, memory system, and host memory allocation permit it.
cudaStream_t s0, s1;
cudaStreamCreate(&s0);
cudaStreamCreate(&s1);
cudaMemcpyAsync(d_x0, h_x0, bytes, cudaMemcpyHostToDevice, s0);
kernel<<<grid, block, 0, s0>>>(d_x0, d_y0, n);
cudaMemcpyAsync(h_y0, d_y0, bytes, cudaMemcpyDeviceToHost, s0);
cudaMemcpyAsync(d_x1, h_x1, bytes, cudaMemcpyHostToDevice, s1);
kernel<<<grid, block, 0, s1>>>(d_x1, d_y1, n);
cudaMemcpyAsync(h_y1, d_y1, bytes, cudaMemcpyDeviceToHost, s1);
For host-device overlap, page-locked host memory allocated with cudaMallocHost or cudaHostAlloc is the usual requirement. The CPU must also avoid reading h_y0 before the corresponding stream completes, for example by calling cudaStreamSynchronize(s0).
Many production kernels are not written from scratch. cuBLAS implements dense linear algebra, including GEMM used in attention and MLP layers. cuDNN implements convolution and neural-network primitives. CUB provides block-wide and device-wide primitives such as reductions and scans. Thrust provides C++ parallel algorithms over CUDA backends. Knowing the execution model helps you inspect their memory behavior and compose them without accidental synchronizations.
The Model
The CUDA mental model has four invariants.
First, kernel work is replicated over the grid: each thread executes the same kernel body with different built-in indices.
Second, scheduling happens by warp: 32 lanes share an issued instruction, with masks for inactive lanes.
Third, block-local cooperation is the synchronization boundary inside a normal kernel: __shared__ memory and __syncthreads() are scoped to one block.
Fourth, memory traffic often caps performance. A roofline bound for a kernel with F floating-point operations, B bytes transferred to and from global memory, peak compute P, and peak bandwidth W is:
The arithmetic intensity is:
If I is low, coalescing and reuse matter more than instruction count. If I is high, register use, instruction mix, and tensor-core mapping matter more. A naive vector add has about two reads and one write per element, or 12 bytes for one floating add. Its arithmetic intensity is about 1 / 12 FLOP per byte, so it is bandwidth-bound on almost any modern GPU.
Occupancy is bounded by resources per block:
This formula does not predict runtime by itself. It tells you which resource prevents more warps from residing on the SM.
Common Confusions
Thinking a block spans several SMs
A block is resident on one SM. Threads in a block can use __shared__ memory because they are colocated for the block lifetime. If a problem needs communication across blocks, split it into multiple kernel launches, use atomics carefully, or use a library primitive.
Forgetting the tail bounds check
The launch grid is usually rounded up. With n = 1000 and block size 256, the last 24 threads are outside the valid range. Every kernel that maps threads to array elements needs a guard unless the problem size is padded and the padding is valid to read and write.
Using __syncthreads inside a conditional taken by only part of the block
All non-exited threads in a block must reach the same __syncthreads(). If only threads with threadIdx.x < 128 call it in a 256-thread block, the block can deadlock. Put the conditional inside the region between barriers, not around the barrier itself.
Assuming shared memory removes races
Shared memory is shared by all threads in the block. If two threads write buf[0] without a defined ordering, the result is a race. Use per-thread locations, reductions with barriers, atomics where appropriate, or a CUB block primitive.
Exercises
Problem
A kernel launches with blockDim.x = 256 and gridDim.x = 40 to process n = 10000 floats. How many threads are launched, how many are out of bounds, and what global index is computed by blockIdx.x = 12, threadIdx.x = 77?
Problem
A warp loads float x[i * stride + lane] for lanes 0..31, with i = 0. Compare stride = 1 and stride = 32 when x starts at byte address 4096. Give the byte addresses for lanes 0, 1, 2, and 31.
Problem
An SM supports 2048 resident threads, 32 resident blocks, 65536 registers, and 96 KiB shared memory. A kernel uses 128 threads per block, 64 registers per thread, and 12 KiB shared memory per block. Compute the resident blocks per SM and thread occupancy.
References
Canonical:
- NVIDIA, CUDA C++ Programming Guide (2025), §§2.2, 5.2, 6.2 - execution hierarchy, memory spaces, synchronization, streams
- David B. Kirk and Wen-mei W. Hwu, Programming Massively Parallel Processors, 4th ed. (2022), ch. 3-5 - CUDA threads, memory, tiled matrix multiplication
- John L. Hennessy and David A. Patterson, Computer Architecture: A Quantitative Approach, 6th ed. (2019), §§4.4, 5.2 - SIMD/SIMT context and memory hierarchy
- Samuel Williams, Andrew Waterman, and David Patterson, “Roofline: An Insightful Visual Performance Model for Multicore Architectures,” Communications of the ACM 52(4), 2009, §§1-4 - operational intensity and bandwidth ceilings
- Ashish Vaswani et al., “Attention Is All You Need,” NeurIPS 2017, §3.2 - attention workloads that map to GEMM and memory-bound tensor operations
- Woosuk Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention,” SOSP 2023, §§2-4 - GPU memory management pressure in LLM serving
Accessible:
- Mark Harris, “An Easy Introduction to CUDA C and C++,” NVIDIA Developer Blog
- Mark Harris, “Using Shared Memory in CUDA C/C++,” NVIDIA Developer Blog
- Christopher Olah et al., “Transformer Circuits Thread” (2021)
Next Topics
/computationpath/gpu-memory-hierarchy/computationpath/roofline-model/computationpath/attention-kernels/computationpath/transformer-inference-serving/computationpath/cuda-profiling-basics