Memory-Mapped I/O and mmap

Why This Matters

A 32-bit store to physical address 0x3F20_001C on a Raspberry Pi writes a GPIO set register. A load from a different address can read a network card status register. No syscall is involved. The CPU issues a load or store, but the memory controller routes the transaction to a device, not DRAM.

The same load/store interface is reused by mmap(): a byte at p[4096] may not exist in RAM until the first access traps into the kernel, which reads the file page, installs a page-table entry, and restarts the instruction. This unifies file I/O, shared memory, and large heap mappings with the virtual-memory machinery from paging.

Core Definitions

Definition

Memory-Mapped I/O

Memory-mapped I/O, abbreviated MMIO, maps device registers into the processor physical address space. CPU loads and stores to those physical addresses are converted by the memory system into device transactions rather than DRAM reads and writes.

Definition

Port-Mapped I/O

Port-mapped I/O gives devices a separate I/O address space. On x86, the in and out instructions access this space, so inb $0x60, %al reads one byte from I/O port 0x60, not from memory address 0x60.

Definition

mmap()

mmap() creates a virtual-memory area in a process address space. The area maps either a file, a shared object, or anonymous zero-filled memory. Page-table entries are installed lazily; the first access to an unmapped page raises a page fault.

Definition

Copy-on-Write

Copy-on-write maps the same physical page into more than one address space as read-only. On a write fault, the kernel allocates a new physical page, copies the old bytes, and remaps the faulting process to the private copy.

Device Registers as Physical Addresses

An MMIO device exposes registers at physical addresses. A register is often 32 or 64 bits wide. It may have side effects on read or write.

A small UART might expose this layout:

physical base: 0x1000_0000

offset  bytes  name      meaning
0x00    4      DATA      read receives byte, write transmits byte
0x04    4      STATUS    bit 0 = rx ready, bit 5 = tx empty
0x08    4      CTRL      bit 0 = enable, bit 1 = interrupt enable
0x0C    4      BAUD      baud-rate divisor

A transmit loop writes only when STATUS[5] is set. If the CPU executes a 32-bit store to 0x1000_0000, the device sees a write to DATA.

#include <stdint.h>

#define UART_BASE   0x10000000u
#define UART_DATA   (*(volatile uint32_t *)(UART_BASE + 0x00))
#define UART_STATUS (*(volatile uint32_t *)(UART_BASE + 0x04))
#define TX_EMPTY    (1u << 5)

void uart_putc(char c) {
    while ((UART_STATUS & TX_EMPTY) == 0) {
        /* spin until hardware reports space */
    }
    UART_DATA = (uint32_t)(unsigned char)c;
}

The volatile qualifier matters because the compiler must emit the load each time through the loop. Without it, an optimizing compiler can load UART_STATUS once, decide the value does not change inside the loop, and produce an infinite loop or a write without polling.

volatile does not create a hardware ordering guarantee. It constrains compiler optimization around that access. CPUs may still reorder ordinary memory operations around MMIO unless the architecture says otherwise or the code uses memory attributes and barriers.

A common transmit descriptor sequence has two phases. First write a descriptor in DRAM. Then ring a device doorbell register.

struct desc {
    uint64_t addr;
    uint32_t len;
    uint32_t flags;
};

static inline void compiler_barrier(void) {
    __asm__ __volatile__("" ::: "memory");
}

/* AArch64 data synchronization barrier for store completion. */
static inline void dsb_st(void) {
    __asm__ __volatile__("dsb st" ::: "memory");
}

void submit(struct desc *d, volatile uint32_t *doorbell, uint64_t buf, uint32_t n) {
    d->addr = buf;
    d->len = n;
    d->flags = 1;
    compiler_barrier();
    dsb_st();
    *doorbell = 1;
}

If the doorbell store reaches the device before the descriptor stores are visible, the device can fetch stale length or flags. The correct barrier depends on the CPU architecture and on the page attributes used for the MMIO mapping. Device memory is usually mapped as uncached and strongly ordered or as device-nGnRE on ARM systems, but ordinary DRAM descriptors need explicit ordering before the MMIO doorbell.

Port-Mapped I/O Versus MMIO

Port-mapped I/O separates device addresses from memory addresses. x86 keeps this legacy interface. A keyboard controller read can look like this:

    inb $0x60, %al      # read byte from I/O port 0x60 into AL
    outb %al, $0x80     # write byte to I/O port 0x80

The address 0x60 here is not a virtual or physical memory address. It belongs to an I/O port space. The in and out instructions are privileged in many operating-system configurations, so user processes cannot normally execute them.

MMIO uses normal load and store instructions after the OS maps a physical device range into a kernel virtual address. A kernel driver might do:

volatile uint32_t *regs = map_device_region(0x10000000u, 4096);

uint32_t status = regs[1];   /* offset 0x04 */
regs[0] = 'A';               /* offset 0x00 */

The byte-level difference is in address interpretation. With port I/O, instruction opcode plus immediate port selects a separate bus cycle. With MMIO, the virtual address is translated to a physical address, and the memory system routes that physical range to the device.

mmap() Maps Virtual Pages to Objects

mmap() maps a byte range of a file or anonymous object into virtual memory. The address returned is page-aligned. The file offset must be a multiple of the page size, commonly 4096 bytes.

#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdint.h>
#include <stdio.h>

int main(void) {
    int fd = open("tiny.csv", O_RDONLY);
    struct stat st;
    fstat(fd, &st);

    char *p = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
    if (p == MAP_FAILED) return 1;

    long sum = 0;
    long x = 0;
    for (off_t i = 0; i < st.st_size; i++) {
        unsigned char c = (unsigned char)p[i];
        if (c >= '0' && c <= '9') {
            x = 10 * x + (c - '0');
        } else if (c == ',' || c == '\n') {
            sum += x;
            x = 0;
        }
    }
    printf("%ld\n", sum);

    munmap(p, st.st_size);
    close(fd);
}

For a file containing these 18 bytes:

31 32 2C 33 34 0A 35 2C 36 0A 37 38 39 2C 31 0A
 1  2  ,  3  4 \n  5  ,  6 \n  7  8  9  ,  1 \n

The parser computes 12 + 34 + 5 + 6 + 789 + 1 = 847.

Suppose the system has 4096-byte pages and the file has 9000 bytes. The mapping covers three virtual pages:

virtual page 0: file offsets 0..4095
virtual page 1: file offsets 4096..8191
virtual page 2: file offsets 8192..8999, then zero-fill to page end

On the first access to p[5000], the PTE is invalid. The CPU raises a page fault. The kernel finds the VMA, reads or locates file page containing offset 4096, installs a present PTE, and returns. The faulting load restarts and reads the byte.

If the file is not a multiple of the page size, bytes past end-of-file inside the last mapped page read as zero. Access past the requested mapping length is outside the mapping and can fault with SIGSEGV.

MAP_PRIVATE, MAP_SHARED, and Anonymous Memory

MAP_PRIVATE gives the process a private view. File bytes are used as initial contents, but writes do not update the file. The first write to a clean file-backed page triggers copy-on-write.

file page F:        41 42 43 0A ...     "ABC\n"
process A mapping: points to F, read-only
process B mapping: points to F, read-only

A writes p[1] = 'x'

new page A':        41 78 43 0A ...     "AxC\n"
process A mapping: points to A', writable
process B mapping: still points to F
file page F:        41 42 43 0A ...

MAP_SHARED maps writes back to the underlying object. For a regular file, dirty pages are written later by the kernel or by msync(). For a shared-memory object, other processes mapping the same object see the writes after cache coherence and normal memory ordering rules.

int fd = shm_open("/cp_demo", O_CREAT | O_RDWR, 0600);
ftruncate(fd, 4096);

uint64_t *x = mmap(NULL, 4096, PROT_READ | PROT_WRITE,
                   MAP_SHARED, fd, 0);
x[0] = 0x1122334455667788ull;

On a little-endian machine, the first eight bytes in memory are:

88 77 66 55 44 33 22 11

Another process mapping /cp_demo with MAP_SHARED and reading x[0] will observe the 64-bit value once it reads after the write becomes visible. For interprocess protocols, use atomics or process-shared pthread synchronization. mmap() shares bytes, not invariants.

MAP_ANONYMOUS maps zero-filled memory not backed by a named file. Large malloc() implementations often request memory from the kernel with anonymous private mappings.

void *p = mmap(NULL, 1 << 20, PROT_READ | PROT_WRITE,
               MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

At creation, no physical pages need be allocated. The first write to each page creates a zeroed physical page and installs a PTE.

Trade-Offs Against read()

read() copies bytes from the page cache into a user buffer. mmap() maps page-cache pages into the process and lets loads read them directly. For a 1 GiB file and 4096-byte pages, a full scan touches 2^30 / 2^12 = 262144 pages. If every page faults once, the scan has 262144 minor or major fault events. Readahead reduces disk stalls, but fault handling still costs CPU time.

mmap() avoids a copy into a separate user buffer. It works well for random access, sparse access, sharing a large read-only model file across worker processes, and database buffer managers that keep page identity visible. It can cost more than read() for a single sequential pass if fault overhead and TLB misses dominate.

TLB pressure is concrete. With 4096-byte pages and a 64-entry data TLB, the TLB covers only 64 * 4096 = 262144 bytes. A stride that touches one byte per page across a 1 GiB mapping will churn that TLB. Huge pages or batched sequential access patterns reduce the miss rate, but the mapping itself does not remove translation cost.

Key Result

Two invariants explain most behavior.

First, MMIO correctness needs both compiler and hardware ordering:

$\text{descriptor stores visible} \prec \text{doorbell MMIO store}$

volatile handles repeated emitted accesses to registers. Barriers and memory attributes handle the ordering relation.

Second, mmap() does not read the whole file at mapping time. It creates a virtual-memory promise:

$\text{virtual address} = \text{object} + \text{page-aligned offset}$

A page fault materializes that promise by filling or locating a physical page and updating the page table. MAP_PRIVATE changes the write path by replacing a shared physical page with a copied page for the faulting process. MAP_SHARED keeps the mapping tied to the shared object.

Common Confusions

Watch Out

volatile is not a device protocol

volatile uint32_t *reg prevents the compiler from deleting or merging the register access. It does not flush descriptor writes from a store buffer, does not make cached DRAM visible to a device, and does not create a multi-register transaction. Use architecture barriers, correct cacheability attributes, and the device-specified register order.

Watch Out

mmap is not always faster than read

mmap() removes one user-space copy for file-backed pages, but it adds page-fault handling and can increase TLB misses. For a single sequential scan, read() with a large buffer can match or beat mmap(). For random indexed access to a large file, mmap() often gives simpler code and avoids manual buffer management.

Watch Out

MAP_PRIVATE does not mean a private initial read

With MAP_PRIVATE, clean pages can be shared physically between processes until someone writes. Two processes reading the same mapped file can use the same page-cache physical page. Privacy begins on the first write fault.

Exercises

ExerciseCore

Problem

A 10,000-byte file is mapped with mmap(NULL, 10000, PROT_READ, MAP_PRIVATE, fd, 0) on a system with 4096-byte pages. How many virtual pages are in the mapping, what file offsets do they cover, and what happens when the program reads byte offset 9999?

ExerciseCore

Problem

A little-endian process writes 0x00000001 to a 32-bit MMIO control register at offset 0x08 from physical base 0x10000000. What byte values are placed on byte lanes for address range 0x10000008..0x1000000B? Why is volatile needed in the C lvalue?

ExerciseAdvanced

Problem

Two processes map the same 4096-byte file page. Both mappings start with bytes 41 42 43 0A. Process A uses MAP_PRIVATE; process B uses MAP_SHARED. A executes p[1] = 0x78. B then reads the first four bytes. What can B read, assuming no other writer changes the file?

References

Canonical:

Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau, Operating Systems: Three Easy Pieces (2018), ch. 13-23, address spaces, paging, TLBs, and VM policy
Randal E. Bryant and David R. O'Hallaron, Computer Systems: A Programmer's Perspective, 3rd ed. (2016), ch. 9, virtual memory and memory mapping
David A. Patterson and John L. Hennessy, Computer Organization and Design RISC-V Edition, 2nd ed. (2021), §4.9 and §4.10, virtual memory and I/O organization
Andrew S. Tanenbaum and Herbert Bos, Modern Operating Systems, 4th ed. (2015), ch. 3 and ch. 5, memory management and input/output
Intel Corporation, Intel 64 and IA-32 Architectures Software Developer's Manual, Vol. 1, ch. 19 and Vol. 2, IN/OUT, memory ordering and port I/O

Accessible:

Linux man-pages project, mmap(2), msync(2), and shm_open(3)
Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman, Linux Device Drivers, 3rd ed., ch. 9, open-access discussion of I/O memory and barriers
University of Wisconsin CS 537 notes on virtual memory and memory-mapped files

Next Topics

/computationpath/page-tables-and-tlbs
/computationpath/virtual-memory-areas
/computationpath/device-drivers-and-interrupts
/computationpath/filesystems-page-cache
/topics/cache-coherence