Skip to main content

Memory · 35 min

TLB and Locality

Walking a 4-level page table costs ~100 cycles. The TLB caches recent translations, and locality decides whether memory code stays inside that cache.

Why This Matters

A 4 KiB page covers only 512 doubles. A 64-entry L1 data TLB covers about 256 KiB of double arrays, and a 1536-entry second-level TLB covers about 6 MiB. A model checkpoint, embedding table, JVM heap, or database buffer pool is far larger.

When an address misses in the TLB, the CPU must translate it by reading page-table entries from memory. On a common x86-64 4-level page table, that can mean four dependent memory reads before the original load even starts. If those reads miss cache, the penalty is often around 100 cycles or more. Locality is the difference between paying that cost once per page and paying it once per element.

Core Definitions

Definition

Translation Lookaside Buffer

A Translation Lookaside Buffer, or TLB, is a small associative cache mapping virtual page numbers to physical page numbers plus permission bits. It is checked on instruction fetches, loads, and stores before the cache access can use the physical address.

Definition

Page-Table Walk

A page-table walk is the lookup procedure used after a TLB miss. For x86-64 with 4 KiB pages and 4-level paging, the hardware uses fields of the virtual address to read one 8-byte entry at each of the PML4, PDPT, page-directory, and page-table levels.

Definition

TLB Reach

The TLB reach is the amount of virtual memory that can be translated without evicting TLB entries, approximated by E×PE \times P, where EE is the number of usable entries for a page size and PP is the page size.

Definition

Address Space Identifier

An Address Space Identifier, or ASID, is a tag stored with TLB entries so translations from multiple address spaces can coexist. Without ASIDs, a context switch usually requires flushing most TLB entries because the same virtual page number can name different physical frames in different processes.

What a Page-Table Walk Reads

Take a canonical 48-bit x86-64 virtual address with 4 KiB pages:

virtual address: 0x00007f123456789a

bits 47..39  PML4 index       0x0fe = 254
bits 38..30  PDPT index       0x048 = 72
bits 29..21  PD index         0x1a2 = 418
bits 20..12  PT index         0x167 = 359
bits 11..0   page offset      0x89a = 2202

Each page-table entry is 8 bytes. If CR3 points at the physical base of the PML4 page, the walker reads:

PML4 entry address: CR3       + 254 * 8 = CR3       + 0x7f0
PDPT entry address: PDPT_base +  72 * 8 = PDPT_base + 0x240
PD entry address:   PD_base   + 418 * 8 = PD_base   + 0xd10
PT entry address:   PT_base   + 359 * 8 = PT_base   + 0xb38

A final page-table entry might contain the physical frame number and permission flags:

64-bit PTE value:       0x0000000123456067
little-endian bytes:    67 60 45 23 01 00 00 00
physical frame base:    0x0000000123456000
page offset:            0x000000000000089a
translated address:     0x000000012345689a

The low bits encode flags such as present, writable, user-accessible, accessed, and dirty. The high middle bits encode the physical page number. The TLB stores the finished translation, so the next access to any byte in the same 4 KiB virtual page avoids this walk.

On x86, the page walker is hardware. The load or store stalls while the core issues the dependent reads. On MIPS and older SPARC designs, a TLB miss traps to privileged software, and the operating system fills the TLB entry. Software handling gives the OS more control over page-table layout, but the trap path is long unless the architecture and kernel keep it small.

Sizes, Context Switches, and Tags

A typical recent CPU has separate first-level instruction and data TLBs. A common order of magnitude is an L1 data TLB with about 64 entries for 4 KiB pages, plus a unified second-level TLB with about 1024 to 2048 entries. Exact counts vary by microarchitecture and page size. The performance calculation starts with the reach:

64 entries * 4 KiB      = 256 KiB
1536 entries * 4 KiB    = 6 MiB
2048 entries * 4 KiB    = 8 MiB

The L1 dTLB is fast but tiny. The second-level TLB is slower but still far cheaper than a full page walk. A workload that repeatedly touches 3 MiB of pages can fit in a 1536-entry second-level TLB with 4 KiB pages. A workload that rotates across 32 MiB cannot.

Context switches create a separate problem. Process A and process B may both use virtual address 0x00007f1234567000, but the mappings can point to different physical frames. If TLB entries were tagged only by virtual page number, B could accidentally use A's translation after a switch.

There are two broad designs. With no address-space tags, the kernel flushes the TLB on a context switch, so the next process starts cold. With ASIDs, each TLB entry includes a process tag, and lookup requires both virtual page number and ASID to match. This preserves useful translations across switches. Modern x86 has process-context identifiers, and many RISC designs have ASIDs. Kernel details are architecture-specific, but the invariant is simple: translations are reusable only when the address-space tag matches.

Working Set Versus TLB Reach

For sequential access to a large array, each 4 KiB page yields 4096 bytes of work before the next TLB miss. For double precision values, that is 512 elements per page. The TLB miss rate per element can be near 1/5121/512 for a linear scan, even when the array is much larger than the TLB.

Bad strides destroy that amortization. In C, a declaration double a[R][C] is row-major, so a[i][j] and a[i][j+1] are adjacent. The address is:

addr(a[i][j])=base+8(iC+j)\mathrm{addr}(a[i][j]) = \mathrm{base} + 8(iC + j)

For R = C = 4096, a column sweep has stride 4096×8=327684096 \times 8 = 32768 bytes, which is 8 pages. This loop touches one double, skips 32760 bytes, and lands far away:

double sum_column_major(double *a, int n) {
    double s = 0.0;
    for (int j = 0; j < n; j++) {
        for (int i = 0; i < n; i++) {
            s += a[(size_t)i * n + j];  // stride n doubles
        }
    }
    return s;
}

For n = 4096, the matrix has 4096 * 4096 = 16,777,216 doubles and occupies 128 MiB. The row-major scan touches 128 MiB / 4 KiB = 32768 pages, so a lower bound is about 32768 data TLB misses. The column-major scan touches a different page for almost every access inside a column. Its active set is 4096 pages per column, larger than a 1536-entry second-level TLB, so it can approach one miss per element: 16,777,216 misses. That is a 512-fold difference in translation misses before considering cache misses.

The row-major version is the same arithmetic but a different order:

double sum_row_major(double *a, int n) {
    double s = 0.0;
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < n; j++) {
            s += a[(size_t)i * n + j];  // unit stride
        }
    }
    return s;
}

Huge Pages and Tiling

Huge pages increase reach by increasing PP. With the same entry count, a 2 MiB page gives 512 times the reach of a 4 KiB page:

64 entries * 2 MiB      = 128 MiB
1536 entries * 2 MiB    = 3 GiB
64 entries * 1 GiB      = 64 GiB

Actual TLBs often have separate capacities for 4 KiB, 2 MiB, and 1 GiB pages, so these numbers are upper-bound mental arithmetic rather than a contract. Still, the direction is exact. Databases, JVM heaps, analytics systems, and ML training or inference processes often have large arrays with stable mappings. Huge pages reduce page-table memory, reduce TLB miss pressure, and reduce page-walk traffic. They do not fix random access within memory bandwidth limits, fragmentation, or poor NUMA placement.

Tiling attacks the same problem from the program side. It keeps the active pages small enough for both cache and TLB. A blocked traversal of a row-major matrix looks like this:

void add_blocked(double *c, const double *a, const double *b,
                 int n, int block) {
    for (int ii = 0; ii < n; ii += block) {
        for (int jj = 0; jj < n; jj += block) {
            int imax = ii + block < n ? ii + block : n;
            int jmax = jj + block < n ? jj + block : n;

            for (int i = ii; i < imax; i++) {
                for (int j = jj; j < jmax; j++) {
                    size_t k = (size_t)i * n + j;
                    c[k] = a[k] + b[k];
                }
            }
        }
    }
}

For block = 64, each tile of one matrix is 64 * 64 * 8 = 32768 bytes, or 8 pages. Three arrays need about 24 pages for the tile body, well within a 64-entry L1 dTLB. For block = 512, one tile per matrix is 2 MiB, or 512 pages; three arrays need about 1536 pages, enough to fill a typical second-level TLB. A block size chosen only from cache capacity can still overload translation.

Measuring TLB Misses

Linux perf exposes hardware counters, but event names vary by CPU. Start with the portable names and then inspect model-specific events:

perf stat -e cycles,instructions,dTLB-loads,dTLB-load-misses,dTLB-store-misses ./walk 4096
perf list | grep -i dtlb

On some Intel systems, model-specific names include events such as dtlb_load_misses.walk_completed or dtlb_load_misses.walk_active. The generic dTLB-load-misses count is usually the first diagnostic. Compare row-major and column-major binaries with the same data size, compiler flags, and CPU affinity. The useful ratio is:

misses per kilo instruction=1000×dTLB load missesinstructions\mathrm{misses\ per\ kilo\ instruction} = 1000 \times \frac{\mathrm{dTLB\ load\ misses}}{\mathrm{instructions}}

If this ratio jumps by orders of magnitude when a loop order changes, translation locality is part of the bottleneck.

Key Result

The working test for TLB locality is:

active pagesusable TLB entries\mathrm{active\ pages} \leq \mathrm{usable\ TLB\ entries}

Equivalently, the active byte footprint should fit the TLB reach:

active bytesE×P\mathrm{active\ bytes} \leq E \times P

This is not the same as total allocation size. A 128 MiB array can stream with good TLB behavior because only the current page and a few nearby pages are active. A 128 MiB column sweep can have poor TLB behavior because the inner loop cycles across thousands of pages before reusing any one page.

Two invariants follow. First, spatial locality within a page amortizes one translation over many bytes. Second, changing page size multiplies reach: replacing 4 KiB pages with 2 MiB pages multiplies page coverage by 512 for entries that support that page size.

The boundary case matters. Suppose the inner loop touches 900 independent 4 KiB pages. It misses the 64-entry L1 dTLB but can fit in a 1536-entry second-level TLB. Suppose it touches 3000 pages. It cannot fit in that second-level TLB, so repeated sweeps will keep evicting translations unless huge pages or tiling reduce the active page count.

Common Confusions

Watch Out

Cache locality and TLB locality are related but not identical

A loop can have decent cache behavior and still have poor TLB behavior if it touches too many pages. The reverse also occurs: a streaming scan has many cache misses on first touch, but TLB misses are amortized across all cache lines in a page.

Watch Out

Huge pages do not make random access cheap

Huge pages reduce translation misses. They do not reduce the number of cache lines fetched, the number of DRAM row conflicts, or pointer-chasing dependencies. A random lookup table may still bottleneck on cache misses after TLB misses fall.

Watch Out

A TLB miss is not always a page fault

A TLB miss means the translation is absent from the TLB. A page fault means the page-table entry denies the access, is not present, or requires OS action. Most TLB misses are handled without invoking the OS on hardware-walked architectures.

Exercises

ExerciseCore

Problem

For a machine with a 64-entry L1 dTLB and a 1536-entry second-level TLB, compute the reach for 4 KiB and 2 MiB pages. Then decide whether an inner loop that repeatedly touches 1200 distinct pages of the same size should fit in each TLB.

ExerciseCore

Problem

A C matrix double a[4096][4096] is stored row-major. Estimate the lower-bound number of 4 KiB data pages touched by a full row-major scan, then estimate the near-worst-case number of TLB-missing loads for a column-major scan when the second-level TLB has 1536 entries.

ExerciseAdvanced

Problem

Given a virtual address 0x00007f123456789a on x86-64 with 4 KiB pages, compute the four 9-bit page-table indices and the 12-bit offset. If the final PTE is 0x0000000123456067, compute the translated physical address.

References

Canonical:

  • Arpaci-Dusseau and Arpaci-Dusseau, Operating Systems: Three Easy Pieces (2023), ch. 13-23, address spaces, paging, TLBs, and page replacement
  • Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, 3rd ed. (2016), ch. 9, virtual memory and address translation
  • Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 6th ed. (2017), Appendix B, memory hierarchy review including virtual memory and TLBs
  • Patterson and Hennessy, Computer Organization and Design RISC-V Edition (2017), §5.7, virtual memory in the memory hierarchy
  • Intel, Intel 64 and IA-32 Architectures Software Developer's Manual, Vol. 3A, ch. 4, paging structures and translation

Accessible:

  • Ulrich Drepper, What Every Programmer Should Know About Memory (2007), §6, cache and TLB behavior for programmers
  • Brendan Gregg, perf Examples, Linux performance counter examples including TLB events
  • Denning, The Working Set Model for Program Behavior (1968), the original working-set paper

Next Topics