Virtual Memory: Pages, Page Tables, Faults

Why This Matters

A 64-bit load such as mov (%rdi), %rax does not send %rdi directly to DRAM. The core splits the virtual address into page-table indexes and an offset, the MMU checks permissions, and only then emits a physical address. If the mapping is missing, the instruction traps into the kernel instead of returning data.

This indirection buys four concrete properties. A process cannot read another process's frames without a mapping. The same binary can run at different physical locations. The system can promise more virtual memory than DRAM by paging to storage. fork() can copy a process by sharing its pages until one side writes.

Core Definitions

Definition

Virtual address

A virtual address is the address named by a CPU instruction running in a process. In 4-level x86-64 paging, common user addresses use 48 meaningful bits: 9 bits for each of PGD, PUD, PMD, and PTE indexes, followed by a 12-bit page offset.

Definition

Page and frame

A page is a fixed-size block of virtual memory. The usual base page size is 4096 bytes. A frame is the physical-memory block that can hold one page. Huge pages use larger translation units, commonly 2 MiB or 1 GiB on x86-64.

Definition

Page table entry

A page table entry, or PTE, records a physical frame number plus control bits such as present, writable, user-accessible, accessed, dirty, and no-execute. The MMU consumes these bits on every translation.

Definition

Page fault

A page fault is a synchronous exception raised when a virtual address cannot be translated with the required permission. The faulting instruction has not completed; after the kernel repairs the mapping, the CPU retries the instruction.

Address Spaces, Pages, and Frames

Each process gets a private virtual address space. "Private" means the page tables selected while the process runs describe its view of memory. Two processes may use the same virtual address, for example 0x7fff00001000, while those addresses map to different frames.

A 4096-byte page has a 12-bit offset because $2^{12}=4096$ . For any virtual address $a$ and page size $P=4096$ :

$\text{VPN}=\lfloor a/P \rfloor$

$\text{offset}=a \bmod P$

For a = 0x00007fff1a2b3c48, the offset is 0xc48. The byte addressed is 3144 bytes from the start of its virtual page.

Huge pages reduce translation overhead. A 2 MiB page has a 21-bit offset, and a 1 GiB page has a 30-bit offset. The tradeoff is internal fragmentation and coarser protection. A 2 MiB mapping for a tensor buffer saves 511 lower-level PTEs compared with 512 separate 4 KiB pages, but a single dirty byte pins the larger unit as modified.

Virtual memory also separates allocation from residency. A process can reserve a 1 GiB sparse address range while only touching 12 KiB. Only the touched pages need frames.

#include <sys/mman.h>
#include <stdint.h>
#include <stdio.h>
#include <unistd.h>

int main(void) {
    size_t n = 1UL << 30;              // 1 GiB virtual range
    uint8_t *p = mmap(NULL, n, PROT_READ | PROT_WRITE,
                      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (p == MAP_FAILED) return 1;

    p[0] = 1;                          // faults one 4 KiB page
    p[4096] = 2;                       // faults a second page
    p[n - 1] = 3;                      // faults a third page
    printf("%u %u %u\n", p[0], p[4096], p[n - 1]);
    munmap(p, n);
}

The address range is 1 GiB. The resident physical memory after these stores can be three 4 KiB frames plus page-table overhead, not 1 GiB.

Four-Level Page Tables on x86-64

In common 4-level x86-64 paging, a 48-bit canonical virtual address is divided as follows:

Bits	Name	Meaning
47..39	PGD index	top-level page-directory entry
38..30	PUD index	second-level directory entry
29..21	PMD index	third-level directory entry
20..12	PTE index	leaf page-table entry
11..0	offset	byte within a 4 KiB page

Linux often names these levels PGD, PUD, PMD, and PTE. Intel manuals use PML4, page-directory-pointer table, page directory, and page table. Five-level paging adds one level above PGD on machines configured for it; the rest of this page stays with 4-level paging.

Worked split for 0x00007fff1a2b3c48:

Field	Value
PGD	`0x0ff` = 255
PUD	`0x1fc` = 508
PMD	`0x0d1` = 209
PTE	`0x0b3` = 179
offset	`0xc48` = 3144

The hardware walk is a chain of memory reads. The CR3 register points to the current top-level table. The MMU indexes entry 255 there, obtains the frame containing the PUD, indexes entry 508, and continues until the leaf PTE gives the target frame.

; User instruction with a virtual address in rdi.
; The load retires only after address translation and permission checks.
mov    (%rdi), %rax

If the final PTE names physical frame number 0x12345, then the physical byte address is:

$0x12345 \times 0x1000 + 0xc48 = 0x12345c48$

Page Table Entries and Permission Bits

A 64-bit x86-64 leaf PTE contains a physical frame number in the middle bits and flags around it. Ignoring bits not used here, the relevant fields are:

Bit	Name	Meaning when set
0	P	mapping is present
1	RW	writes are allowed
2	US	user mode may access
5	A	page has been accessed
6	D	page has been written
63	NX	instruction fetch is disallowed

For a present, writable, user page whose physical frame number is 0x12345, with accessed, dirty, and NX set, the PTE value is:

$0x8000000012345067$

Little-endian bytes in memory:

Address offset	Byte
+0	`67`
+1	`50`
+2	`34`
+3	`12`
+4	`00`
+5	`00`
+6	`00`
+7	`80`

The low byte 0x67 is not part of the frame number. It is flags: present 0x01, writable 0x02, user 0x04, accessed 0x20, and dirty 0x40. The frame address is recovered by masking away the low 12 bits and the high flag bits.

Protection is checked during translation. If user code fetches an instruction from a page with NX set, the CPU raises a page fault with an instruction-fetch cause. If user code writes to a present read-only page, the fault is a protection fault, not a missing-page fault.

Page Faults, Demand Paging, and `mmap`

A page fault transfers control to the kernel with the faulting virtual address and an error code. The kernel looks up the virtual memory area, often called a VMA, that covers the address. That VMA records permissions and backing: anonymous zero-fill memory, a file range, shared memory, or a device mapping.

A minor fault does not need disk I/O. Examples include mapping a fresh zero page for anonymous memory, allocating a private frame on a copy-on-write write fault, or installing a PTE for a file page already in the page cache. A major fault waits for storage because the needed file or swapped page is not resident.

mmap() is the common interface for both file-backed and anonymous mappings.

#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>

void demo(size_t n) {
    int fd = open("weights.bin", O_RDONLY);

    float *weights = mmap(NULL, n, PROT_READ, MAP_PRIVATE, fd, 0);
    float *scratch = mmap(NULL, 4096, PROT_READ | PROT_WRITE,
                          MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

    volatile float x = weights[0];     // file-backed page fault possible
    scratch[0] = x;                    // anonymous zero-fill fault possible

    munmap((void *)weights, n);
    munmap(scratch, 4096);
    close(fd);
}

The same fault handler structure covers both accesses. For weights[0], the kernel can read bytes from weights.bin into a frame and install a read-only PTE. For scratch[0], it can allocate a zeroed frame and install a writable anonymous PTE.

`fork()` and Copy-on-Write

A naive fork() would copy every writable page in the parent. Copy-on-write avoids that copy. The kernel duplicates the parent's page tables, but writable private pages are marked read-only in both parent and child. Both PTEs point at the same physical frame, and a per-frame reference count records the sharing.

State transition for one 4 KiB heap page:

Time	Parent PTE	Child PTE	Frame refs
before `fork()`	frame `F`, writable	absent	1
after `fork()`	frame `F`, read-only COW	frame `F`, read-only COW	2
child writes	frame `F`, read-only COW	frame `G`, writable	1 each

The write in the child raises a protection fault because the PTE is present but read-only. The kernel sees the COW mark, allocates frame G, copies 4096 bytes from F to G, changes the child's PTE to writable, and retries the store.

#include <sys/wait.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

int main(void) {
    int *x = malloc(4096);
    x[0] = 7;

    pid_t pid = fork();
    if (pid == 0) {
        x[0] = 9;                      // COW fault in child
        _exit(x[0]);
    }

    wait(NULL);
    printf("%d\n", x[0]);              // still 7 in parent
}

The parent prints 7. The child wrote a private copy, not the parent's frame.

Working Sets and Replacement

If all resident pages fit in DRAM, page faults after warmup are mostly minor. If the active pages exceed available frames, the OS must evict pages and later fault them back. The working set of a process over a window $\Delta$ is the set of pages referenced in the last $\Delta$ time units or references:

$W(t,\Delta)=\{p \mid p\text{ was referenced during }(t-\Delta,t]\}$

Exact least-recently-used replacement is expensive because it requires a total order of page references. Systems approximate it using hardware accessed bits. The clock algorithm keeps frames in a circular list. On a scan, an accessed page gets a second chance: clear its accessed bit and advance. A page with accessed bit 0 is a victim.

struct frame {
    int page;
    unsigned accessed;
    unsigned dirty;
};

int clock_victim(struct frame *f, int n, int *hand) {
    for (;;) {
        if (f[*hand].accessed == 0) {
            int v = *hand;
            *hand = (*hand + 1) % n;
            return v;
        }
        f[*hand].accessed = 0;
        *hand = (*hand + 1) % n;
    }
}

Example with four frames, hand at frame 0:

Frame	Page	Accessed	Dirty
0	10	1	0
1	11	1	1
2	12	0	0
3	13	1	0

The scan clears frame 0, clears frame 1, then selects frame 2. A dirty victim needs writeback before reuse if it is file-backed shared memory or swapped anonymous memory.

Key Result

The central invariant is per-process translation plus permission checking:

$\text{physical address}=\text{PFN} \times \text{page size}+\text{offset}$

This equation is valid only if the selected PTE is present and permits the access type from the current privilege level. Otherwise the architectural result is a page fault, not a physical address.

A useful cost model separates the steady-state hit path from the fault path. If the translation is cached in the TLB, the data access proceeds without a page-table walk. If the TLB misses but the page is resident, the hardware walk or kernel refill adds memory references. If the page is absent, the page fault cost ranges from a few microseconds for a minor fault to storage latency for a major fault. The next page in this module treats the TLB path directly.

Two invariants explain most behavior seen by systems programmers. First, virtual allocation does not imply physical residency. Second, protection bits are part of the mapping, not properties of the virtual address alone. The same virtual address value can be readable in one process, unmapped in another, and executable in a third.

Common Confusions

Watch Out

A large virtual allocation is not the same as using that much RAM

mmap(NULL, 1UL << 40, ...) can reserve 1 TiB of address space on a 64-bit process. Physical frames are charged as pages become resident. Overcommit policy, address-space limits, and cgroup limits still matter, but the reservation itself is not a 1 TiB DRAM copy.

Watch Out

A page fault is not always an error

A fault on a valid VMA is often the normal demand-paging path. The kernel allocates or reads a frame, installs a PTE, and restarts the instruction. An invalid address, such as a null dereference with no mapped page, becomes a signal such as SIGSEGV.

Watch Out

Copy-on-write does not copy at fork time

The copy happens on the first write to a shared private page. Read-only pages and never-written heap pages can stay shared between parent and child for their whole lifetime.

Exercises

ExerciseCore

Problem

Split the virtual address 0x00007fff1a2b3c48 into 4-level x86-64 indexes and offset. If the leaf PTE has PFN 0x12345, compute the physical address.

ExerciseCore

Problem

A parent has one writable private page in frame 0xabc. It calls fork(). The child writes byte offset 0x20 in that page. Describe the PTE states before and after the write.

ExerciseAdvanced

Problem

Run the clock algorithm on four frames with accessed bits [1, 1, 0, 1], dirty bits [0, 1, 0, 0], and hand at index 0. Which frame is evicted, what accessed bits remain, and does the victim require writeback?

References

Canonical:

Arpaci-Dusseau and Arpaci-Dusseau, Operating Systems: Three Easy Pieces (2023), ch. 13-23, address spaces, paging, TLBs, swapping, and replacement
Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective (3rd ed., 2016), ch. 9, virtual memory and mmap
Silberschatz, Galvin, and Gagne, Operating System Concepts (10th ed., 2018), ch. 9-10, main memory and virtual memory
Tanenbaum and Bos, Modern Operating Systems (4th ed., 2015), ch. 3, memory management and page replacement
Intel, Intel 64 and IA-32 Architectures Software Developer's Manual, vol. 3A, ch. 4, IA-32e paging structures and PTE bits

Accessible:

Ulrich Drepper, What Every Programmer Should Know About Memory
Linux kernel documentation, Memory Management
Pavel Emelyanov and Mike Rapoport, Linux documentation on page tables and virtual memory areas

Next Topics

/computationpath/translation-lookaside-buffer
/computationpath/processes
/computationpath/memory-allocation
/topics/cache-locality

Why This Matters

Core Definitions

Address Spaces, Pages, and Frames

Four-Level Page Tables on x86-64

Page Table Entries and Permission Bits

Page Faults, Demand Paging, and mmap

fork() and Copy-on-Write

Working Sets and Replacement

Key Result

Common Confusions

Exercises

References

Next Topics

Page Faults, Demand Paging, and `mmap`

`fork()` and Copy-on-Write