Skip to main content

Memory · 45 min

Virtual Memory

Each process sees a flat private 64-bit address space. The kernel and MMU map virtual pages to physical frames, faulting pages from disk when needed.

Why This Matters

A 64-bit load such as mov (%rdi), %rax does not send %rdi directly to DRAM. The core splits the virtual address into page-table indexes and an offset, the MMU checks permissions, and only then emits a physical address. If the mapping is missing, the instruction traps into the kernel instead of returning data.

This indirection buys four concrete properties. A process cannot read another process's frames without a mapping. The same binary can run at different physical locations. The system can promise more virtual memory than DRAM by paging to storage. fork() can copy a process by sharing its pages until one side writes.

Core Definitions

Definition

Virtual address

A virtual address is the address named by a CPU instruction running in a process. In 4-level x86-64 paging, common user addresses use 48 meaningful bits: 9 bits for each of PGD, PUD, PMD, and PTE indexes, followed by a 12-bit page offset.

Definition

Page and frame

A page is a fixed-size block of virtual memory. The usual base page size is 4096 bytes. A frame is the physical-memory block that can hold one page. Huge pages use larger translation units, commonly 2 MiB or 1 GiB on x86-64.

Definition

Page table entry

A page table entry, or PTE, records a physical frame number plus control bits such as present, writable, user-accessible, accessed, dirty, and no-execute. The MMU consumes these bits on every translation.

Definition

Page fault

A page fault is a synchronous exception raised when a virtual address cannot be translated with the required permission. The faulting instruction has not completed; after the kernel repairs the mapping, the CPU retries the instruction.

Address Spaces, Pages, and Frames

Each process gets a private virtual address space. "Private" means the page tables selected while the process runs describe its view of memory. Two processes may use the same virtual address, for example 0x7fff00001000, while those addresses map to different frames.

A 4096-byte page has a 12-bit offset because 212=40962^{12}=4096. For any virtual address aa and page size P=4096P=4096:

VPN=a/P\text{VPN}=\lfloor a/P \rfloor

offset=amodP\text{offset}=a \bmod P

For a = 0x00007fff1a2b3c48, the offset is 0xc48. The byte addressed is 3144 bytes from the start of its virtual page.

Huge pages reduce translation overhead. A 2 MiB page has a 21-bit offset, and a 1 GiB page has a 30-bit offset. The tradeoff is internal fragmentation and coarser protection. A 2 MiB mapping for a tensor buffer saves 511 lower-level PTEs compared with 512 separate 4 KiB pages, but a single dirty byte pins the larger unit as modified.

Virtual memory also separates allocation from residency. A process can reserve a 1 GiB sparse address range while only touching 12 KiB. Only the touched pages need frames.

#include <sys/mman.h>
#include <stdint.h>
#include <stdio.h>
#include <unistd.h>

int main(void) {
    size_t n = 1UL << 30;              // 1 GiB virtual range
    uint8_t *p = mmap(NULL, n, PROT_READ | PROT_WRITE,
                      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (p == MAP_FAILED) return 1;

    p[0] = 1;                          // faults one 4 KiB page
    p[4096] = 2;                       // faults a second page
    p[n - 1] = 3;                      // faults a third page
    printf("%u %u %u\n", p[0], p[4096], p[n - 1]);
    munmap(p, n);
}

The address range is 1 GiB. The resident physical memory after these stores can be three 4 KiB frames plus page-table overhead, not 1 GiB.

Four-Level Page Tables on x86-64

In common 4-level x86-64 paging, a 48-bit canonical virtual address is divided as follows:

BitsNameMeaning
47..39PGD indextop-level page-directory entry
38..30PUD indexsecond-level directory entry
29..21PMD indexthird-level directory entry
20..12PTE indexleaf page-table entry
11..0offsetbyte within a 4 KiB page

Linux often names these levels PGD, PUD, PMD, and PTE. Intel manuals use PML4, page-directory-pointer table, page directory, and page table. Five-level paging adds one level above PGD on machines configured for it; the rest of this page stays with 4-level paging.

Worked split for 0x00007fff1a2b3c48:

FieldValue
PGD0x0ff = 255
PUD0x1fc = 508
PMD0x0d1 = 209
PTE0x0b3 = 179
offset0xc48 = 3144

The hardware walk is a chain of memory reads. The CR3 register points to the current top-level table. The MMU indexes entry 255 there, obtains the frame containing the PUD, indexes entry 508, and continues until the leaf PTE gives the target frame.

; User instruction with a virtual address in rdi.
; The load retires only after address translation and permission checks.
mov    (%rdi), %rax

If the final PTE names physical frame number 0x12345, then the physical byte address is:

0x12345×0x1000+0xc48=0x12345c480x12345 \times 0x1000 + 0xc48 = 0x12345c48

Page Table Entries and Permission Bits

A 64-bit x86-64 leaf PTE contains a physical frame number in the middle bits and flags around it. Ignoring bits not used here, the relevant fields are:

BitNameMeaning when set
0Pmapping is present
1RWwrites are allowed
2USuser mode may access
5Apage has been accessed
6Dpage has been written
63NXinstruction fetch is disallowed

For a present, writable, user page whose physical frame number is 0x12345, with accessed, dirty, and NX set, the PTE value is:

0x80000000123450670x8000000012345067

Little-endian bytes in memory:

Address offsetByte
+067
+150
+234
+312
+400
+500
+600
+780

The low byte 0x67 is not part of the frame number. It is flags: present 0x01, writable 0x02, user 0x04, accessed 0x20, and dirty 0x40. The frame address is recovered by masking away the low 12 bits and the high flag bits.

Protection is checked during translation. If user code fetches an instruction from a page with NX set, the CPU raises a page fault with an instruction-fetch cause. If user code writes to a present read-only page, the fault is a protection fault, not a missing-page fault.

Page Faults, Demand Paging, and mmap

A page fault transfers control to the kernel with the faulting virtual address and an error code. The kernel looks up the virtual memory area, often called a VMA, that covers the address. That VMA records permissions and backing: anonymous zero-fill memory, a file range, shared memory, or a device mapping.

A minor fault does not need disk I/O. Examples include mapping a fresh zero page for anonymous memory, allocating a private frame on a copy-on-write write fault, or installing a PTE for a file page already in the page cache. A major fault waits for storage because the needed file or swapped page is not resident.

mmap() is the common interface for both file-backed and anonymous mappings.

#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>

void demo(size_t n) {
    int fd = open("weights.bin", O_RDONLY);

    float *weights = mmap(NULL, n, PROT_READ, MAP_PRIVATE, fd, 0);
    float *scratch = mmap(NULL, 4096, PROT_READ | PROT_WRITE,
                          MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

    volatile float x = weights[0];     // file-backed page fault possible
    scratch[0] = x;                    // anonymous zero-fill fault possible

    munmap((void *)weights, n);
    munmap(scratch, 4096);
    close(fd);
}

The same fault handler structure covers both accesses. For weights[0], the kernel can read bytes from weights.bin into a frame and install a read-only PTE. For scratch[0], it can allocate a zeroed frame and install a writable anonymous PTE.

fork() and Copy-on-Write

A naive fork() would copy every writable page in the parent. Copy-on-write avoids that copy. The kernel duplicates the parent's page tables, but writable private pages are marked read-only in both parent and child. Both PTEs point at the same physical frame, and a per-frame reference count records the sharing.

State transition for one 4 KiB heap page:

TimeParent PTEChild PTEFrame refs
before fork()frame F, writableabsent1
after fork()frame F, read-only COWframe F, read-only COW2
child writesframe F, read-only COWframe G, writable1 each

The write in the child raises a protection fault because the PTE is present but read-only. The kernel sees the COW mark, allocates frame G, copies 4096 bytes from F to G, changes the child's PTE to writable, and retries the store.

#include <sys/wait.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

int main(void) {
    int *x = malloc(4096);
    x[0] = 7;

    pid_t pid = fork();
    if (pid == 0) {
        x[0] = 9;                      // COW fault in child
        _exit(x[0]);
    }

    wait(NULL);
    printf("%d\n", x[0]);              // still 7 in parent
}

The parent prints 7. The child wrote a private copy, not the parent's frame.

Working Sets and Replacement

If all resident pages fit in DRAM, page faults after warmup are mostly minor. If the active pages exceed available frames, the OS must evict pages and later fault them back. The working set of a process over a window Δ\Delta is the set of pages referenced in the last Δ\Delta time units or references:

W(t,Δ)={pp was referenced during (tΔ,t]}W(t,\Delta)=\{p \mid p\text{ was referenced during }(t-\Delta,t]\}

Exact least-recently-used replacement is expensive because it requires a total order of page references. Systems approximate it using hardware accessed bits. The clock algorithm keeps frames in a circular list. On a scan, an accessed page gets a second chance: clear its accessed bit and advance. A page with accessed bit 0 is a victim.

struct frame {
    int page;
    unsigned accessed;
    unsigned dirty;
};

int clock_victim(struct frame *f, int n, int *hand) {
    for (;;) {
        if (f[*hand].accessed == 0) {
            int v = *hand;
            *hand = (*hand + 1) % n;
            return v;
        }
        f[*hand].accessed = 0;
        *hand = (*hand + 1) % n;
    }
}

Example with four frames, hand at frame 0:

FramePageAccessedDirty
01010
11111
21200
31310

The scan clears frame 0, clears frame 1, then selects frame 2. A dirty victim needs writeback before reuse if it is file-backed shared memory or swapped anonymous memory.

Key Result

The central invariant is per-process translation plus permission checking:

physical address=PFN×page size+offset\text{physical address}=\text{PFN} \times \text{page size}+\text{offset}

This equation is valid only if the selected PTE is present and permits the access type from the current privilege level. Otherwise the architectural result is a page fault, not a physical address.

A useful cost model separates the steady-state hit path from the fault path. If the translation is cached in the TLB, the data access proceeds without a page-table walk. If the TLB misses but the page is resident, the hardware walk or kernel refill adds memory references. If the page is absent, the page fault cost ranges from a few microseconds for a minor fault to storage latency for a major fault. The next page in this module treats the TLB path directly.

Two invariants explain most behavior seen by systems programmers. First, virtual allocation does not imply physical residency. Second, protection bits are part of the mapping, not properties of the virtual address alone. The same virtual address value can be readable in one process, unmapped in another, and executable in a third.

Common Confusions

Watch Out

A large virtual allocation is not the same as using that much RAM

mmap(NULL, 1UL << 40, ...) can reserve 1 TiB of address space on a 64-bit process. Physical frames are charged as pages become resident. Overcommit policy, address-space limits, and cgroup limits still matter, but the reservation itself is not a 1 TiB DRAM copy.

Watch Out

A page fault is not always an error

A fault on a valid VMA is often the normal demand-paging path. The kernel allocates or reads a frame, installs a PTE, and restarts the instruction. An invalid address, such as a null dereference with no mapped page, becomes a signal such as SIGSEGV.

Watch Out

Copy-on-write does not copy at fork time

The copy happens on the first write to a shared private page. Read-only pages and never-written heap pages can stay shared between parent and child for their whole lifetime.

Exercises

ExerciseCore

Problem

Split the virtual address 0x00007fff1a2b3c48 into 4-level x86-64 indexes and offset. If the leaf PTE has PFN 0x12345, compute the physical address.

ExerciseCore

Problem

A parent has one writable private page in frame 0xabc. It calls fork(). The child writes byte offset 0x20 in that page. Describe the PTE states before and after the write.

ExerciseAdvanced

Problem

Run the clock algorithm on four frames with accessed bits [1, 1, 0, 1], dirty bits [0, 1, 0, 0], and hand at index 0. Which frame is evicted, what accessed bits remain, and does the victim require writeback?

References

Canonical:

  • Arpaci-Dusseau and Arpaci-Dusseau, Operating Systems: Three Easy Pieces (2023), ch. 13-23, address spaces, paging, TLBs, swapping, and replacement
  • Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective (3rd ed., 2016), ch. 9, virtual memory and mmap
  • Silberschatz, Galvin, and Gagne, Operating System Concepts (10th ed., 2018), ch. 9-10, main memory and virtual memory
  • Tanenbaum and Bos, Modern Operating Systems (4th ed., 2015), ch. 3, memory management and page replacement
  • Intel, Intel 64 and IA-32 Architectures Software Developer's Manual, vol. 3A, ch. 4, IA-32e paging structures and PTE bits

Accessible:

  • Ulrich Drepper, What Every Programmer Should Know About Memory
  • Linux kernel documentation, Memory Management
  • Pavel Emelyanov and Mike Rapoport, Linux documentation on page tables and virtual memory areas

Next Topics

  • /computationpath/translation-lookaside-buffer
  • /computationpath/processes
  • /computationpath/memory-allocation
  • /topics/cache-locality