Processes and Threads in Operating Systems

Why This Matters

A Linux fork() can return in tens of microseconds on a quiet machine even when the parent maps gigabytes of memory, because page tables are copied lazily and data pages are marked copy-on-write. The first write after the fork can cost a page fault, a physical page allocation, and a 4096-byte copy.

Threads remove the address-space switch when computation moves between tasks in the same process. That matters for inference servers, data loaders, RPC runtimes, and training input pipelines. A thread switch still saves registers and scheduler state, but it need not change the page table base register or invalidate address translations for a different address space.

Core Definitions

Definition

Process

A process is an executing program with an isolated virtual address space, a table of open file descriptors, signal dispositions, credentials, resource limits, and at least one thread. The address space maps virtual page numbers to physical frames or to backing objects such as files.

Definition

Thread

A thread is a schedulable execution context inside a process. It has its own program counter, stack pointer, general registers, floating-point and vector registers when live, kernel scheduling state, and usually thread-local storage. Threads in the same process share the address space and open file descriptor table.

Definition

fork

fork() creates a child process whose initial memory image is a logical duplicate of the parent. The call returns 0 in the child and the child's process ID in the parent. Modern Unix systems implement the memory duplicate with copy-on-write page table entries.

Definition

exec

execve() replaces the current process image with a new program. The process ID remains, many process attributes remain, but the old user address space is discarded and replaced by mappings for the executable, loader, libraries, heap, and stack.

Definition

Zombie process

A zombie is an exited child whose exit status still occupies a kernel process-table entry because the parent has not collected it with wait() or waitpid(). It has no running user address space.

Process State: Address Space, File Descriptors, and One Thread

A process is more than just code. It is a container for mappings and kernel objects. A small 64-bit Linux process after dynamic linking often has mappings like this:

virtual range              permissions   backing object
0x0000555555554000-...     r-x           /bin/cat text
0x0000555555754000-...     rw-           /bin/cat data
0x00007ffff7dc0000-...     r-x           libc.so
0x00007ffffffde000-...     rw-           user stack
0xffffffffff600000-...     r-x           vsyscall page

Each mapping is page-granular. With 4096-byte pages, virtual address 0x7ffffffde888 has page offset 0x888 and virtual page number 0x7ffffffde. The page table translates that page number to a physical frame when the CPU touches the address.

The file descriptor table is also per process unless a Linux clone() call asks to share it. A typical descriptor table after shell redirection might look like this:

fd   kernel object                     offset     flags
0    pipe read end                     n/a        close-on-exec off
1    /tmp/out.txt open file desc       8192       close-on-exec off
2    terminal device                   n/a        close-on-exec off
3    socket inode 98211                n/a        close-on-exec on

Descriptors are integers in the process. The entries point to kernel file descriptions. Duplicating descriptor 1 with dup2(1, 5) creates another descriptor that refers to the same open file description, so the file offset is shared.

The initial thread enters at the program entry point, later reaches main, and carries a stack. On little-endian x86-64, a stack area holding a return address and two 64-bit local values can be viewed as bytes:

address          bytes                         meaning
0x7fffffffe0d0   88 e2 ff ff ff 7f 00 00       saved rbp = 0x00007fffffffe288
0x7fffffffe0d8   6a 13 40 00 00 00 00 00       return rip = 0x000000000040136a
0x7fffffffe0e0   2a 00 00 00 00 00 00 00       local long x = 42
0x7fffffffe0e8   00 04 00 00 00 00 00 00       local size = 1024

A second thread in the same process would see the same heap address for malloc data but a different stack range and a different current instruction pointer.

fork, Copy-On-Write, exec, and wait

The fork and exec model separates process creation from program replacement. A shell forks, the child sets up file descriptors, and the child executes a new program. The parent waits or continues.

#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

int main(void) {
    pid_t pid = fork();

    if (pid < 0) {
        perror("fork");
        exit(1);
    }

    if (pid == 0) {
        // Child: replace this process image with /bin/echo.
        char *argv[] = {"echo", "child ran exec", NULL};
        execv("/bin/echo", argv);
        perror("execv");   // Only reached on error.
        _exit(127);
    }

    // Parent: pid is the child's process ID.
    int status = 0;
    pid_t got = waitpid(pid, &status, 0);
    printf("waitpid returned %ld, raw status=0x%x\n", (long)got, status);
    return 0;
}

If /bin/echo exits with code 0, the low byte of the raw status is used for signal state and the exit code is encoded in higher bits on POSIX-like systems. Portable code uses WIFEXITED(status) and WEXITSTATUS(status).

Copy-on-write makes fork() cheap until either process writes. Suppose a parent maps three writable pages:

virtual page   physical frame   PTE before fork
0x40000        0x91a20          writable, private
0x40001        0x91a21          writable, private
0x40002        0x91a22          writable, private

Immediately after fork(), parent and child page tables can both point to the same frames, with write permission removed and a COW bit recorded by the kernel:

process   virtual page   physical frame   PTE after fork
parent    0x40000        0x91a20          read-only, COW
child     0x40000        0x91a20          read-only, COW
parent    0x40001        0x91a21          read-only, COW
child     0x40001        0x91a21          read-only, COW

If the child stores one byte to virtual address 0x40001020, the CPU raises a protection fault. The kernel allocates a new 4096-byte frame, copies frame 0x91a21 into it, updates only the child's PTE for page 0x40001, and resumes the store.

execve() then discards these user mappings. It does not create a new process ID. File descriptors remain open unless marked close-on-exec. That one flag is why shell pipelines work and why long-running servers set FD_CLOEXEC on listening sockets or secret-bearing descriptors before spawning helpers.

A child that exits before the parent waits becomes a zombie. The kernel must keep the PID, parent link, resource usage, and exit status so the parent can observe them. Calling waitpid(-1, &status, WNOHANG) in a reaper loop is common in supervisors.

pthread_create and Linux clone

POSIX threads present a portable interface. On Linux, NPTL implements pthread_create() using clone() with flags that request sharing of process resources. The exact flags vary by architecture and library version, but the core idea is shared virtual memory plus separate execution context.

#include <pthread.h>
#include <stdio.h>

static int counter = 0;

void *worker(void *arg) {
    int *p = (int *)arg;
    counter += *p;          // Shared global. This has a data race.
    return NULL;
}

int main(void) {
    pthread_t t;
    int x = 7;
    pthread_create(&t, NULL, worker, &x);
    pthread_join(t, NULL);
    printf("counter=%d\n", counter);
}

Both threads can read and write counter, because they share the same address space. The local variable x lives on the main thread's stack, but the worker can still dereference its address while main is alive. That is legal only because the address space is shared; it would not work across fork() without shared memory.

Linux exposes the lower-level mechanism through clone():

#define _GNU_SOURCE
#include <sched.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

static int child_fn(void *arg) {
    int *shared = (int *)arg;
    *shared = 123;
    return 0;
}

int main(void) {
    int shared = 0;
    char *stack = malloc(1 << 20);
    char *stack_top = stack + (1 << 20);

    int flags = CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | SIGCHLD;
    pid_t tid = clone(child_fn, stack_top, flags, &shared);
    if (tid == -1) { perror("clone"); return 1; }

    waitpid(tid, NULL, 0);
    printf("shared=%d\n", shared);
    free(stack);
}

CLONE_VM shares the address space. CLONE_FILES shares the descriptor table. CLONE_FS shares filesystem state such as current working directory. CLONE_SIGHAND shares signal handlers. Real pthread implementations also set up thread-local storage, futex-based joins, signal masks, guard pages, and bookkeeping for cancellation and cleanup handlers.

Without CLONE_VM, the child would get a separate address space like fork(), and the parent's shared would remain 0 after the child wrote its private copy.

Switching Costs: Registers, Page Tables, and TLBs

A context switch saves the outgoing execution context and loads the incoming one. For two threads in one process, the kernel can keep the same address space. For two processes, it usually changes the page table root.

A simplified x86-64 thread switch saves callee-saved registers and stack pointer in a thread control block:

# rdi = &old->ctx, rsi = &new->ctx
switch_threads:
    movq %rsp, 0(%rdi)
    movq %rbx, 8(%rdi)
    movq %rbp, 16(%rdi)
    movq %r12, 24(%rdi)
    movq %r13, 32(%rdi)
    movq %r14, 40(%rdi)
    movq %r15, 48(%rdi)

    movq 0(%rsi), %rsp
    movq 8(%rsi), %rbx
    movq 16(%rsi), %rbp
    movq 24(%rsi), %r12
    movq 32(%rsi), %r13
    movq 40(%rsi), %r14
    movq 48(%rsi), %r15
    ret

This toy context is 56 bytes. Real kernels also handle flags, interrupt state, kernel stack metadata, FPU and SIMD state, debug registers, and accounting. AVX-512 state alone can be kilobytes, though kernels save it lazily or with CPU support where possible.

The address-space part is the expensive distinction. The CPU caches virtual-to-physical translations in the TLB. If the OS switches from process A to process B and writes a new page-table root, many cached translations for A cannot be used for B. With PCIDs or ASIDs, some entries can remain tagged by address space; without them, a broad flush loses locality.

A numeric example shows the shape. Suppose a loop touches 2048 distinct 4 KiB pages, one cache line per page. A 128-entry data TLB can hold translations for only 128 of those pages. If a process switch flushes the TLB, the next pass over 2048 pages has up to 2048 page-walks. If each page-walk costs 100 cycles, translation misses add about 204800 cycles. At 3.2 GHz that is about 64 microseconds. The numbers vary by CPU, but the mechanism is fixed.

Thread switches within one process keep the same page tables, so TLB entries for shared code, heap, and libraries remain valid. The switch still disrupts caches when the new thread works on different data, and lock handoff can add futex syscalls.

User Threads, Kernel Threads, and Hybrid Designs

A user thread package schedules multiple user-level threads on one kernel thread. Switching can be a few register moves and a stack pointer change because the kernel is not entered. The cost is poor interaction with blocking syscalls and multicore parallelism unless the runtime uses nonblocking I/O or multiple kernel threads. Fiber and green-thread libraries fit here, but their APIs are outside this page.

A one-to-one kernel-thread design maps each user-visible thread to a schedulable kernel entity. POSIX threads on modern Linux use this model. Blocking read() blocks one thread, not all threads in the process, and multiple threads can run on different cores.

Hybrid M:N systems multiplex M user threads over N kernel threads. They reduce kernel scheduling overhead for some workloads but add runtime and kernel coordination. Scheduler activations are the classic research design. Most general-purpose systems moved away from exposing M:N threading as the default process model because signal delivery, blocking I/O, debugger state, and language runtime invariants became hard to specify.

Key Result

A practical process and thread model is governed by two invariants and one cost relation.

process identity = address space + kernel resource tables + at least one thread
thread identity  = registers + stack + scheduler state + membership in a process

For a switch between runnable contexts,

T_{\text{process switch}} \approx T_{\text{thread switch}} + T_{\text{address-space switch}} + T_{\text{TLB refill}}

The relation is approximate because caches, PCIDs, interrupt timing, SIMD state, and scheduler policy change the measured value. It is still the right design rule for systems work. Use processes when fault isolation, separate permissions, or independent lifecycle matter. Use threads when shared memory and cheap communication outweigh data-race risk.

A second invariant concerns fork and exec:

\text{fork copies the process abstraction; exec replaces the user program image}

fork() preserves the parent program image in the child until the child changes it. exec() preserves the process identity while replacing the address space. That split is why a shell can set up descriptors after fork() but before exec().

Common Confusions

Watch Out

A process is not the same thing as a program

/bin/grep is an executable file. Running it three times creates three processes, each with a PID, address space, descriptor table, and thread. The same bytes on disk can back text mappings in all three.

Watch Out

fork does not eagerly copy all RAM

After fork(), a 16 GiB parent does not imply a 16 GiB memory copy. Writable private pages are usually shared as read-only COW mappings. The cost appears when parent or child writes those pages.

Watch Out

Threads share memory, not stacks

Threads in one process share the address space, so one thread can write another thread's stack if it has the address. They do not use the same stack pointer or the same stack region during normal execution.

Watch Out

A zombie is not a running process

A zombie has already exited. It consumes a small kernel table entry, not CPU time. The fix is for the parent to call wait() or waitpid(), or for the child to be reparented to a process that reaps it.

Exercises

ExerciseCore

Problem

A parent process maps two writable 4096-byte pages. Page A maps to physical frame 0x1000 and page B maps to frame 0x1001. The parent calls fork(). The child writes one byte to page B. Draw the page table entries for parent and child after the write.

ExerciseCore

Problem

A program opens /tmp/x, then calls fork(). Parent and child both inherit descriptor 3, which points to the same open file description with offset 0. The parent writes 100 bytes through fd 3. The child then writes 50 bytes through fd 3. What offset is stored in the shared open file description after both writes?

ExerciseAdvanced

Problem

Assume a process switch flushes a 256-entry TLB. A workload then touches 1024 different 4 KiB pages once. Each TLB miss costs 80 cycles beyond the cache access. Estimate the added time at 2.5 GHz. Compare it with a thread switch that keeps those translations hot.

References

Canonical:

Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau, Operating Systems: Three Easy Pieces (2018), ch. 4-12, 27-32, 36-43, 48, process API, address spaces, concurrency, locks, and distributed file system background.
Abraham Silberschatz, Peter B. Galvin, and Greg Gagne, Operating System Concepts (10th ed., 2018), ch. 3-4, processes, threads, scheduling context, and interprocess communication.
Andrew S. Tanenbaum and Herbert Bos, Modern Operating Systems (4th ed., 2015), ch. 2, processes, threads, IPC, and scheduling.
Maurice J. Bach, The Design of the UNIX Operating System (1986), ch. 6-7, Unix process control, fork, exec, signals, and scheduling.
Michael Kerrisk, The Linux Programming Interface (2010), ch. 24-28, 29-30, 52, process creation, program execution, process termination, POSIX threads, and clone.
David A. Patterson and John L. Hennessy, Computer Organization and Design RISC-V Edition (2nd ed., 2020), §5.7-5.8, virtual memory, TLBs, and address translation costs.

Accessible:

Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective (3rd ed.), ch. 8, Unix process control with concrete C examples.
Linux man-pages project, fork(2), execve(2), waitpid(2), clone(2), and pthread_create(3), API semantics and Linux-specific flags.
OSTEP online chapters, "The Abstraction: The Process", "Interlude: Process API", "Threads Intro", and "Locks", open textbook coverage with runnable examples.

Next Topics

/computationpath/scheduling
/computationpath/locks-and-condition-variables
/computationpath/virtual-memory
/computationpath/system-calls-and-kernel-boundaries