Branch Prediction in Modern CPUs

Why This Matters

A 20-stage pipeline that resolves a branch late can throw away around 20 in-flight stages on a wrong guess. If branches are 20% of dynamic instructions and each wrong guess adds 20 lost cycles, then a 20% misprediction rate adds $0.20 \times 0.20 \times 20 = 0.8$ cycles per instruction. If the base CPI is 1, that is a 1.8 CPI machine before cache misses enter.

The same pipeline with 95% prediction accuracy pays $0.20 \times 0.05 \times 20 = 0.2$ extra CPI. This is why modern cores spend silicon on branch history tables, target buffers, return predictors, and multi-table history predictors. For ML systems code, the slow path is often not arithmetic. It is data-dependent control flow in token filters, sparse kernels, parsers, hash probes, and dispatch loops.

Core Definitions

Definition

Control hazard

A control hazard occurs when the next program counter is unknown because an in-flight branch, jump, call, or return has not yet been resolved. The front end must either wait or predict the next fetch address.

Definition

Branch predictor

A branch predictor is hardware that predicts both direction, taken or not taken, and often target address. Direction prediction answers whether the branch changes control flow. Target prediction answers which address to fetch if it does.

Definition

Branch History Table

A Branch History Table, abbreviated BHT, is an array of small predictor states indexed by bits of the branch PC and sometimes by branch history. A simple BHT entry is a 1-bit or 2-bit state.

Definition

Branch Target Buffer

A Branch Target Buffer, abbreviated BTB, is a cache-like structure that maps branch PCs to predicted target PCs. It is required for taken branches to redirect fetch early, before decode or execute computes the target.

Definition

Misprediction penalty

The misprediction penalty is the number of lost cycles after the core detects that the predicted direction or target was wrong. It includes flushed speculative work and the time to restart fetch at the correct PC.

Pipeline Cost Model

Use this simplified machine. It issues one instruction per cycle, has base CPI 1, resolves conditional branches 20 cycles after fetch, and has dynamic branch frequency $f_b = 0.20$ .

The branch cost is

\Delta \mathrm{CPI} = f_b \times p_m \times P

where $p_m$ is the misprediction probability and $P$ is the penalty in cycles. With $P = 20$ , a predictor that misses 20% of branches adds

0.20 \times 0.20 \times 20 = 0.8

extra CPI. A predictor that misses 5% adds

0.20 \times 0.05 \times 20 = 0.2

extra CPI. The difference is 0.6 cycles per instruction. On a 3 GHz core, one hardware thread with a 1-instruction-per-cycle front end loses about $3 \times 10^9 \times 0.6 = 1.8 \times 10^9$ instruction slots per second.

A branch consumes front-end bandwidth even when predicted correctly. On a wrong path, fetched bytes, decoded micro-ops, rename entries, issue queue entries, and cache bandwidth can all be spent on instructions that never retire. The architectural state rolls back; the microarchitectural work already happened.

Static Prediction

Static prediction uses only the instruction, its address, or a compiler hint. A common rule is backward taken and forward not taken. A backward conditional branch often closes a loop, so predicting taken repeats the loop body. A forward conditional branch often skips a rare block, so predicting not taken continues the fall-through path.

Consider this C loop:

int sum_positive(const int *a, int n) {
    int s = 0;
    for (int i = 0; i < n; i++) {
        if (a[i] > 0) s += a[i];
    }
    return s;
}

A typical loop branch at the bottom jumps backward while i < n. For n = 1000, the loop branch is taken 999 times and not taken once. Backward-taken gets 999 correct and 1 wrong.

The inner if depends on data. If signs are random, a static forward-not-taken rule reaches about 50% on that branch. With a 20-cycle penalty, the branch can dominate the cost of the integer add.

A stylized x86-64 loop branch looks like this:

.Lloop:
    mov     eax, DWORD PTR [rdi + rcx*4]
    test    eax, eax
    jle     .Lskip        # forward branch, static rule often predicts not taken
    add     edx, eax
.Lskip:
    inc     rcx
    cmp     rcx, rsi
    jl      .Lloop        # backward branch, static rule often predicts taken

Static rules are cheap, but they cannot learn that a specific forward branch is usually taken or that a loop branch exits every 8th iteration.

One-Bit and Two-Bit Local Predictors

A 1-bit predictor stores the last observed direction for each BHT entry. Its state is one bit.

State bit	Prediction	After actual not taken	After actual taken
0	not taken	0	1
1	taken	0	1

For a loop branch with directions T T T T N, repeated, the 1-bit predictor does poorly at loop boundaries. Assume it starts in taken state.

Branch outcome	Prediction	Correct?	Next state
T	T	yes	T
T	T	yes	T
T	T	yes	T
T	T	yes	T
N	T	no	N
T	N	no	T

It misses on the exit and again on the first iteration of the next loop. The single bit flips too quickly.

A 2-bit saturating counter fixes this common case. The four states are usually named strongly not taken, weakly not taken, weakly taken, and strongly taken. Prediction uses the high bit. Actual taken increments the counter up to 3. Actual not taken decrements it down to 0.

Counter	Bits	Prediction
0	`00`	not taken
1	`01`	not taken
2	`10`	taken
3	`11`	taken

For the same repeated T T T T N pattern, starting at 11, the exit changes the state to 10 but the next loop entry is still predicted taken.

#include <stdint.h>

static uint8_t update2(uint8_t c, int taken) {
    if (taken) return c == 3 ? 3 : (uint8_t)(c + 1);
    return c == 0 ? 0 : (uint8_t)(c - 1);
}

static int predict2(uint8_t c) {
    return c >= 2;
}

A 4096-entry table of 2-bit counters needs 8192 bits, or 1024 bytes, excluding tags and control. If indexed by low PC bits, entry index might be (pc >> 2) & 0xfff for 4-byte aligned instructions. That shift discards the two zero alignment bits.

Example indexes:

Branch PC	`pc >> 2`	12-bit BHT index
`0x400640`	`0x100190`	`0x190`
`0x401640`	`0x100590`	`0x590`
`0x500640`	`0x140190`	`0x190`

The first and third PCs alias in the BHT. If one branch is usually taken and the other usually not taken, they fight over the same counter. This aliasing is destructive when unrelated branches map to one entry.

Two-Level Adaptive Prediction

Yeh-Patt two-level adaptive prediction adds history. A global history register, GHR, records the last $k$ branch outcomes as bits. A pattern history table stores counters indexed by a combination of PC bits and the GHR.

Let T = 1 and N = 0. With a 3-bit GHR, the recent sequence T N T can be stored as binary 101. On each resolved branch,

ghr = ((ghr << 1) | taken) & 0x7;

A simple gshare-style index xors PC bits with GHR:

uint32_t index(uint64_t pc, uint32_t ghr) {
    uint32_t pc_index = (uint32_t)(pc >> 2) & 0x7;
    return pc_index ^ ghr;
}

Worked trace with an 8-entry pattern table, all counters initialized to weakly taken 10, and branch PC index 011:

Step	GHR before	Index `011 xor GHR`	Prediction	Actual	Counter after	GHR after
1	`000`	`011`	T	T	`11`	`001`
2	`001`	`010`	T	N	`01`	`010`
3	`010`	`001`	T	T	`11`	`101`
4	`101`	`110`	T	N	`01`	`010`
5	`010`	`001`	T	T	`11`	`101`

The same static branch can use different counters under different histories. This captures patterns such as "this branch is taken if the previous comparison failed" or alternating paths in small decision trees.

Two-level predictors still have limits. The table is finite. Different (PC, history) pairs collide. Long histories cost storage and update energy, and speculative history must be repaired after a wrong path.

TAGE Predictors

TAGE stands for tagged geometric history length. It is a family of predictors built from several predictor tables, each indexed using a different history length. The history lengths grow roughly geometrically, such as 0, 4, 8, 16, 32, 64, though real designs choose sizes and folding functions with care.

Each tagged table entry stores a prediction counter, a partial tag, and metadata that tracks usefulness. A lookup consults multiple tables. A table with a matching tag and the longest history usually supplies the prediction. A shorter matching table can act as an alternate prediction when the long-history entry is weak or untrusted.

A compact conceptual entry layout might be:

Field	Bits	Meaning
prediction counter	3	signed or saturating direction confidence
tag	10	partial match to reduce aliasing
usefulness	2	replacement preference
total	15	per entry before array overhead

The byte-level idea is not that the entry must be exactly 15 bits. Hardware packs arrays for area and timing. The point is that TAGE spends bits on tags so that a long-history prediction is not blindly shared by many unrelated branches.

Example lookup:

Table	History length	Index	Stored tag	Computed tag	Match?	Counter
base	0	`0x190`	none	none	yes	weak T
T1	4	`0x033`	`0x12a`	`0x12a`	yes	strong N
T2	16	`0x2c1`	`0x071`	`0x070`	no	weak T
T3	64	`0x091`	`0x155`	`0x155`	yes	weak T

The provider is T3 because it is the longest matching table. If T3 is weak and the alternate T1 says strong not taken, update rules may train the provider, allocate another entry on a miss, or prefer the alternate for a while. Full TAGE designs contain details beyond this page, but the operating picture is simple. Many history lengths compete, tags reduce aliasing, and replacement tries to keep entries that have paid rent.

BTB, Indirect Branches, and Returns

Direction prediction is not enough. A taken branch needs a target address early enough to keep fetch busy. The BTB maps a branch PC to a predicted target. A direct conditional branch has an immediate offset in the instruction, but waiting until decode may still be too late for a wide, deep front end.

A BTB entry conceptually contains:

Field	Example
branch PC tag	high bits of `0x400640`
predicted target	`0x4005f0`
branch type	conditional direct
valid bit	1

Indirect branches are harder because the target comes from a register or memory. C++ virtual calls, jump tables, interpreter dispatch, and function pointers compile to indirect control transfers. The same branch PC can jump to many targets.

struct Op {
    virtual int apply(int x) const = 0;
};

int run(const Op **ops, int n, int x) {
    for (int i = 0; i < n; i++) {
        x = ops[i]->apply(x);  // often an indirect call
    }
    return x;
}

A direct call has a target encoded by the instruction. An indirect call resembles:

mov     rax, QWORD PTR [rdi + rcx*8]
mov     rax, QWORD PTR [rax]
call    QWORD PTR [rax + 16]   # target depends on object type

Returns use a special return address predictor, often a small return-address stack. A call pushes the next PC. A ret predicts the top. Deep recursion, exceptions, coroutines, and mismatched call-return behavior can disturb this structure.

Branchy Code, `cmov`, and If-Conversion

A branch is fast when it is predictable and slow when the data distribution defeats the predictor. The classic example is summing positive values from an array with random signs.

Branch version:

int sum_pos_branch(const int *a, int n) {
    int s = 0;
    for (int i = 0; i < n; i++) {
        if (a[i] > 0) s += a[i];
    }
    return s;
}

Branchless version:

int sum_pos_mask(const int *a, int n) {
    int s = 0;
    for (int i = 0; i < n; i++) {
        int x = a[i];
        s += x & -(x > 0);
    }
    return s;
}

The mask expression makes (x > 0) produce 0 or 1, negates it to 0x00000000 or 0xffffffff, and masks out negative values. For x = 5, the mask is 0xffffffff, so the contribution is 5. For x = -7, the mask is 0, so the contribution is 0.

A compiler may also use cmov on x86-64:

mov     eax, DWORD PTR [rdi + rcx*4]
test    eax, eax
mov     r8d, 0
cmovg   r8d, eax
add     edx, r8d

cmovg avoids a control-flow change, so there is no direction misprediction. It still creates a data dependency, and both source computations may need to be available. A predictable branch can be better than cmov when one side is expensive or rarely executed. A branchless form can be better when the branch outcome is close to random and both sides are cheap.

Spectre v1 exploits speculative execution after a mispredicted bounds check. The predictor trains the CPU to enter a path before the bounds check resolves, and transient instructions can touch cache lines based on out-of-bounds data. The architectural state rolls back, but cache timing can reveal information. See /computationpath/spectre-v1 for the security model.

Key Result

For first-order performance estimates, use this invariant.

\mathrm{CPI} \approx \mathrm{CPI}_{base} + f_b p_m P

With a 20-stage pipeline, 20% branch density, and 4 cycles lost per instruction stream per full misprediction point, the branch term is easy to compute. Since $f_b P = 0.20 \times 20 = 4$ , each absolute 1% of misprediction rate costs $0.04$ CPI.

Accuracy	Misprediction rate	Extra CPI
80%	20%	0.80
90%	10%	0.40
95%	5%	0.20
98%	2%	0.08

The practical invariant is that direction and target must both be available before the front end runs out of useful fetch work. A correct direction with a missing BTB target still stalls taken-branch fetch. A correct target with a wrong direction still flushes the path.

Common Confusions

Watch Out

A 2-bit counter does not remember the last two outcomes

A 2-bit saturating counter is not a two-event shift register. State 10 means weakly taken, not "the last two outcomes were taken then not taken." It is a small confidence state. The sequence T T T N from 11 ends at 10, while N T T T from 00 ends at 11; the same recent suffix can occur with different confidence.

Watch Out

A high branch prediction rate does not make branchy code free

A branch that misses 5% of the time can still dominate a tight loop if the loop body is only a few instructions. With a 20-cycle penalty and 5% misses, the expected cost is 1 cycle per dynamic branch. That is already larger than an integer add on many cores.

Watch Out

The BTB is not the same structure as the direction predictor

The direction predictor says taken or not taken. The BTB supplies the fetch address for taken control flow. A conditional branch can have a correct direction prediction and still lose cycles if its target is absent from the BTB.

Exercises

ExerciseCore

Problem

A loop branch has outcome pattern T T T T T T T N, repeated. Compare a 1-bit predictor and a 2-bit saturating counter. Both start in taken state, with the 2-bit counter at 11. Count misses over two full repetitions.

ExerciseCore

Problem

A machine has base CPI 0.9, branch frequency 18%, and misprediction penalty 17 cycles. Compute CPI at 92% and 97% prediction accuracy.

ExerciseAdvanced

Problem

A BHT has 1024 entries of 2-bit counters indexed by (pc >> 2) & 0x3ff. Compute the indexes for branch PCs 0x400640, 0x401640, and 0x400644. State which aliases occur.

References

Canonical:

John L. Hennessy and David A. Patterson, Computer Architecture: A Quantitative Approach, 6th ed. (2017), §1.5 and §3.3, quantitative CPI modeling and control hazards
Randal E. Bryant and David R. O'Hallaron, Computer Systems: A Programmer's Perspective, 3rd ed. (2016), ch. 4 and §5.7, pipelined processors and branch effects in optimized code
David A. Patterson and John L. Hennessy, Computer Organization and Design RISC-V Edition, 2nd ed. (2020), §4.8, control hazards and branch prediction
John Paul Shen and Mikko H. Lipasti, Modern Processor Design (2005), ch. 5, branch prediction and speculative execution
Tse-Yu Yeh and Yale N. Patt, "Two-Level Adaptive Training Branch Prediction," MICRO-24 (1991), original two-level adaptive predictor paper
André Seznec and Pierre Michaud, "A Case for Partially Tagged Geometric History Length Branch Prediction," Journal of Instruction-Level Parallelism 8 (2006), TAGE predictor family

Accessible:

Agner Fog, The microarchitecture of Intel, AMD and VIA CPUs, sections on branch prediction and branch target buffers
Dan Luu, "Branch prediction," an accessible worked explanation with 1-bit and 2-bit predictor examples
MIT 6.823 Computer System Architecture lecture notes, branch prediction lectures

Next Topics

/computationpath/pipelining
/computationpath/out-of-order-execution
/computationpath/cache-hierarchy
/computationpath/spectre-v1
/topics/markov-chains

Why This Matters

Core Definitions

Pipeline Cost Model

Static Prediction

One-Bit and Two-Bit Local Predictors

Two-Level Adaptive Prediction

TAGE Predictors

BTB, Indirect Branches, and Returns

Branchy Code, cmov, and If-Conversion

Key Result

Common Confusions

Exercises

References

Next Topics

Branchy Code, `cmov`, and If-Conversion