Out-of-Order Execution

Why This Matters

A load that misses in L2 can cost 30 to 80 cycles; a DRAM miss can cost hundreds. An in-order scalar pipeline that reaches mov rax, [rdi] and needs rax for the next instruction stalls the younger instructions behind it, even if those younger instructions are independent additions, address computations, or loop bookkeeping.

Out-of-order execution changes the contract inside the core. The machine fetches and decodes program order, issues ready operations when their operands exist, then retires results in program order. The programmer sees precise exceptions and the architectural register file; the core spends most cycles moving tags, checking dependency bits, and deciding which waiting operations can run.

Core Definitions

Definition

Architectural register

An architectural register is a register named by the instruction set, such as rax in x86-64 or x5 in RISC-V. Program-visible state is defined in terms of architectural registers, memory, the program counter, and status state.

Definition

Physical register

A physical register is a storage slot inside the implementation. Register renaming maps each architectural destination to a fresh physical register so that independent dynamic instructions do not fight over the same architectural name.

Definition

Reorder buffer

A reorder buffer, usually abbreviated ROB, is a FIFO-like structure that records in-flight instructions in program order. It permits out-of-order execution while enforcing in-order retirement and precise exceptions.

Definition

Reservation station

A reservation station is a scheduler entry holding one decoded operation, its source tags or values, its destination tag, and readiness bits. When all sources are ready and a matching functional unit is free, the instruction can issue.

In-Order Pipelines Stall on Name and Data Hazards

Consider this short sequence, written in simplified x86-64 syntax.

; rdi points to an array, rcx is a loop index
1: mov   rax, [rdi]       ; load, maybe misses cache
2: add   rbx, rbx         ; independent
3: lea   rcx, [rcx + 1]   ; independent
4: add   rax, 7           ; depends on instruction 1
5: mov   [rdi], rax       ; depends on instruction 4

In an in-order issue machine, instruction 4 cannot issue before instruction 1 produces rax. If the implementation also requires in-order issue, instruction 2 and instruction 3 sit behind the stalled dependence chain. The machine may have an idle integer ALU, but the issue rule blocks it.

There are three classic hazard classes.

RAW, read after write. Instruction 4 reads the value produced by instruction 1.
WAW, write after write. Two instructions write the same architectural name.
WAR, write after read. A younger write could overwrite a name before an older read consumes it.

Only RAW is a true data dependence. WAW and WAR are name dependences caused by reuse of a small architectural register set. Register renaming removes those two.

A compact example shows the difference.

1: add r1, r2, r3     ; r1 = old r2 + old r3
2: mul r4, r1, r5     ; true RAW on r1
3: add r1, r6, r7     ; WAW with instruction 1
4: sub r8, r1, r9     ; reads the new r1 from instruction 3

Instruction 3 need not wait for instruction 2 merely because both mention r1. After renaming, the dynamic stream could be recorded this way.

Instruction	Architectural destination	Physical destination	Source physical registers
1	`r1`	`p40`	`p2`, `p3`
2	`r4`	`p41`	`p40`, `p5`
3	`r1`	`p42`	`p6`, `p7`
4	`r8`	`p43`	`p42`, `p9`

The WAW on r1 has become two distinct writes to p40 and p42. Instruction 2 still depends on instruction 1 because it reads p40.

Register Renaming and the Rename Table

A rename table maps architectural registers to the latest physical register allocated for each one. A free list supplies unused physical registers. On decode or rename, a destination allocates a new physical register, and the old mapping is saved so it can be released after retirement.

Example initial state:

Architectural	Current physical
`r1`	`p1`
`r2`	`p2`
`r3`	`p3`
`r4`	`p4`
`r5`	`p5`

Rename this sequence.

A: add r1, r2, r3
B: sub r2, r1, r5
C: xor r1, r4, r5

Step-by-step state:

Step	Instruction	Source tags	New mapping	Old mapping kept in ROB
0	before		`r1->p1`, `r2->p2`
1	A	`p2`, `p3`	`r1->p10`	old `r1=p1`
2	B	`p10`, `p5`	`r2->p11`	old `r2=p2`
3	C	`p4`, `p5`	`r1->p12`	old `r1=p10`

The table after C maps r1 to p12. Instruction B still reads p10, so it gets the value produced by A. The old physical register p1 cannot be reused until A retires, because an exception before A commits must restore the older architectural state.

A typical ROB entry needs fields close to the following layout. Exact bit counts vary, but the dependency logic follows this shape.

Byte offset	Field	Example
0	opcode class	integer add
1	destination architectural register	`r1`
2	destination physical register	`p10`
3	old physical register	`p1`
4 to 11	program counter	`0x400610`
12	exception flags	none
13	done bit	0 then 1
14 to 15	age or queue index	`ROB[37]`

The rename table is speculative state. A branch misprediction discards younger mappings and restores a checkpoint, or it walks the ROB backward to undo them.

Reservation Stations and Tag Broadcast

After renaming, instructions enter scheduling structures. Each source operand is either a value or a tag naming the producer. A scheduler entry might look like this:

Entry	Operation	Source 0	Source 1	Destination	Ready?
12	`add`	value 8	value 9	`p20`	yes
13	`imul`	tag `p20`	value 5	`p21`	no
14	`add`	value 3	value 4	`p22`	yes
15	`sub`	tag `p21`	tag `p22`	`p23`	no

If entry 12 issues and later broadcasts p20 = 17, every waiting entry compares its source tags with p20. Entry 13 changes from tag to value.

Cycle	Broadcast	Effect
100	`p20=17`	entry 13 source 0 becomes ready
101	`p22=7`	entry 15 source 1 becomes ready
104	`p21=85`	entry 15 source 0 becomes ready

This is the central trick from Tomasulo-style machines, though modern cores use varied scheduler organizations. The expensive operation is not the addition itself. The expensive operation is waking up many waiting entries, selecting a subset that matches available ports, then broadcasting results fast enough to feed the next cycle.

A four-wide issue machine cannot issue more than four micro-operations in one cycle, even if the dependency graph has 20 ready nodes. A six-wide retire machine cannot retire more than six completed micro-operations per cycle. Peak IPC is bounded by front-end bandwidth, issue width, execution ports, memory ports, and retirement bandwidth.

For a loop with 1000 decoded micro-operations, a 4-wide retire limit gives a best-case lower bound of $1000 / 4 = 250$ cycles. If 300 of those micro-operations need a single load port that accepts one per cycle, the load port alone gives a lower bound of 300 cycles. The real time is at least the maximum of such resource bounds and the critical path length.

Reorder Buffer and Precise Retirement

Out-of-order execution would be unusable for ordinary programs if exceptions could expose half-finished state. The ROB solves this by separating completion from retirement. An instruction completes when its execution result exists. It retires when it reaches the head of the ROB, has completed, and no older instruction must trap.

Consider this instruction stream.

1: mov rax, [rdi]      ; page fault
2: add rbx, rcx        ; completes
3: imul r8, r9         ; completes
4: mov [rsi], r8       ; address computed

Instructions 2 and 3 can execute before instruction 1 receives its fault response. They cannot retire before instruction 1. The store at instruction 4 cannot update the coherent memory system as an architectural store until it is allowed to commit. Many machines let stores compute address and data early, then hold them in a store queue until retirement or a closely related commit point.

A simple retirement trace:

Cycle	ROB head	Completed?	Action
30	instr 1	no	retire none
31	instr 1	no, page walk pending	retire none
80	instr 1	fault	flush younger entries
81	trap handler PC	fetched	architectural state matches before instr 1

Precise exception means the trap is reported as if instructions before the faulting one completed and instructions after it never executed. That property is required by operating systems, debuggers, and language runtimes.

Speculation Window, Branches, and Cache Miss Absorption

Branch prediction lets the front end continue fetching after unresolved branches. The out-of-order window holds speculative work behind those branches. If the prediction is correct, the machine has already renamed and maybe executed many younger instructions. If it is wrong, their results are discarded by flushing the speculative portion of the ROB and restoring the rename state.

Here is a small branch example.

int f(int *a, int n, int x) {
    int s = 0;
    for (int i = 0; i < n; i++) {
        if (a[i] != x) s += a[i];
    }
    return s;
}

A compiler may produce a load, compare, conditional branch, and add. If the branch predictor learns that a[i] != x is usually true, it predicts the add path. The core can execute the load for a[i+1] before the branch for a[i] retires, as long as dependence and memory-order rules allow it.

The size of the window matters. Suppose the ROB holds 224 micro-operations, the loop body decodes to 7 micro-operations, and an L3 hit takes 40 cycles. The window can hold about $224 / 7 = 32$ iterations. If the core can place enough independent cache misses and arithmetic inside those 32 iterations, it hides part of the latency. If every iteration depends on the previous loaded value, the window fills and issue stalls.

Speculation is not architectural commitment. A speculative load can bring a cache line into L1, update predictor tables, or occupy a fill buffer. If the branch later mispredicts, architectural registers are restored, but those microarchitectural traces need not be erased. Spectre, Meltdown, and Microarchitectural Data Sampling attacks exploit such traces. This page only names the boundary; the security details belong in a separate topic on transient execution side channels.

Memory Disambiguation and the Load-Store Queue

Registers are easy to rename because register names are explicit. Memory is harder because two addresses may alias, and a store address may not be known when a younger load is ready.

Modern out-of-order cores use a load-store queue, often split into a load queue and a store queue. It tracks in-flight memory operations in program order, stores computed addresses and data, and checks ordering violations.

Example:

1: mov [rdi], rax       ; store address ready late
2: mov rcx, [rsi]       ; load address ready early
3: add rdx, rcx

If rdi and rsi are different, instruction 2 can execute before instruction 1 has committed. If they are equal, instruction 2 must read the store data from instruction 1 or wait. If instruction 2 speculates early and later discovers that instruction 1 stored to the same address, the core replays the load and dependent instructions.

A byte-level example makes the alias concrete.

Operation	Address	Bytes
older store	`0x1000`	`aa bb cc dd`
younger load	`0x1002`	reads 4 bytes

The younger load overlaps bytes at 0x1002 and 0x1003. It cannot read only from cache as if the store did not exist. It needs forwarding or a replay. If the store writes aa bb cc dd to 0x1000..0x1003 and memory already has 11 22 33 44 55 66 at 0x1000..0x1005, a 4-byte load from 0x1002 should see cc dd 55 66 on a little-endian byte-addressed machine when no other older store covers the last two bytes.

The memory-ordering hardware also enforces the ISA memory model. x86-64 has stronger ordering than many RISC designs for ordinary loads and stores, but even x86 cores execute loads and stores with internal speculation as long as the architectural ordering rules are preserved.

Key Result

For a fixed dynamic instruction stream, an out-of-order core is bounded by both dependence height and machine widths.

T \geq \max\left(\left\lceil \frac{N_{\mu op}}{W_{retire}} \right\rceil,\left\lceil \frac{N_{issue}}{W_{issue}} \right\rceil,\max_r \left\lceil \frac{N_r}{C_r} \right\rceil,L_{crit}\right)

Here $N_{\mu op}$ is the number of retired micro-operations, $W_{retire}$ is retirement width, $W_{issue}$ is issue width, $N_r$ is the count of operations needing resource $r$ , $C_r$ is that resource's per-cycle capacity, and $L_{crit}$ is the dependency-chain latency.

A numeric pass:

Quantity	Value
retired micro-operations	1200
retire width	4
issued micro-operations	1200
issue width	6
load micro-operations	420
load ports	2 per cycle
critical path	260 cycles

Bounds:

\left\lceil 1200 / 4 \right\rceil = 300

\left\lceil 1200 / 6 \right\rceil = 200

\left\lceil 420 / 2 \right\rceil = 210

The lower bound is $\max(300,200,210,260)=300$ cycles. Even perfect prediction and unlimited cache bandwidth would not exceed 4 IPC at retirement here. Real processors add front-end misses, cache misses, branch misses, scheduler conflicts, and finite physical registers.

Common Confusions

Watch Out

Out-of-order execution changes single-thread semantics

It does not change architectural semantics for correctly executed instructions. The ISA defines in-order effects. The implementation may execute instruction 20 before instruction 10, but retirement and exception rules make the committed result match program order. The caveat is microarchitectural state: caches, predictors, and queues can retain traces from wrong-path execution.

Watch Out

Register renaming removes all dependences

Renaming removes WAW and WAR name hazards. It cannot remove RAW dependences. If instruction B needs the value produced by instruction A, B waits for A's physical destination tag to become ready.

Watch Out

A larger ROB always gives higher IPC

A larger window helps only when the program has independent work beyond the current stalls. Pointer chasing such as p = p->next has a long RAW chain. A 512-entry ROB cannot issue the next load address until the previous load returns.

Exercises

ExerciseCore

Problem

Rename the following instruction sequence. Initial mappings are r1->p1, r2->p2, r3->p3, r4->p4, r5->p5. The free list begins p10, p11, p12, p13. Give the source physical registers and destination physical register for each instruction.

1: add r1, r2, r3
2: add r2, r1, r4
3: sub r1, r5, r2
4: xor r4, r1, r3

ExerciseCore

Problem

A loop body decodes to 8 micro-operations. A core has a 192-entry ROB, 4-wide retirement, 6-wide issue, and 2 load ports. The loop runs for 100 iterations and contains 3 load micro-operations per iteration. Ignore branch misses. Compute three lower bounds on total cycles: retirement, issue, and load-port capacity.

ExerciseAdvanced

Problem

An older store writes 4 bytes 10 20 30 40 starting at address 0x2001. A younger 4-byte load reads starting at 0x2003. Existing memory bytes from 0x2000 to 0x2006 are aa bb cc dd ee ff 99. What byte sequence should the load receive if it executes after the store in program order and no other stores overlap?

References

Canonical:

Hennessy and Patterson, Computer Architecture: A Quantitative Approach (6th ed, 2017), §3.3, covers dynamic scheduling and Tomasulo-style execution
Hennessy and Patterson, Computer Architecture: A Quantitative Approach (6th ed, 2017), §3.4, covers speculation, reorder buffers, and multiple issue
Hennessy and Patterson, Computer Architecture: A Quantitative Approach (6th ed, 2017), ch. 2, covers memory hierarchy costs that motivate latency hiding
Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective (3rd ed, 2016), §4.5, covers pipelined processor control and hazards
Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective (3rd ed, 2016), ch. 5, covers program performance and processor limits
Shen and Lipasti, Modern Processor Design (2005), ch. 5, covers dynamic scheduling, register renaming, and reorder buffers

Accessible:

Onur Mutlu, Computer Architecture Lecture Notes, lectures on out-of-order execution and memory dependence handling
Daniel J. Sorin, Mark D. Hill, and David A. Wood, A Primer on Memory Consistency and Cache Coherence
Paul Kocher et al., “Spectre Attacks: Exploiting Speculative Execution” (2019)

Next Topics

/computationpath/branch-prediction
/computationpath/cache-hierarchy
/computationpath/memory-consistency
/computationpath/transient-execution-side-channels