Why This Matters
A 20-stage pipeline that resolves a branch late can throw away around 20 in-flight stages on a wrong guess. If branches are 20% of dynamic instructions and each wrong guess adds 20 lost cycles, then a 20% misprediction rate adds cycles per instruction. If the base CPI is 1, that is a 1.8 CPI machine before cache misses enter.
The same pipeline with 95% prediction accuracy pays extra CPI. This is why modern cores spend silicon on branch history tables, target buffers, return predictors, and multi-table history predictors. For ML systems code, the slow path is often not arithmetic. It is data-dependent control flow in token filters, sparse kernels, parsers, hash probes, and dispatch loops.
Core Definitions
Control hazard
A control hazard occurs when the next program counter is unknown because an in-flight branch, jump, call, or return has not yet been resolved. The front end must either wait or predict the next fetch address.
Branch predictor
A branch predictor is hardware that predicts both direction, taken or not taken, and often target address. Direction prediction answers whether the branch changes control flow. Target prediction answers which address to fetch if it does.
Branch History Table
A Branch History Table, abbreviated BHT, is an array of small predictor states indexed by bits of the branch PC and sometimes by branch history. A simple BHT entry is a 1-bit or 2-bit state.
Branch Target Buffer
A Branch Target Buffer, abbreviated BTB, is a cache-like structure that maps branch PCs to predicted target PCs. It is required for taken branches to redirect fetch early, before decode or execute computes the target.
Misprediction penalty
The misprediction penalty is the number of lost cycles after the core detects that the predicted direction or target was wrong. It includes flushed speculative work and the time to restart fetch at the correct PC.
Pipeline Cost Model
Use this simplified machine. It issues one instruction per cycle, has base CPI 1, resolves conditional branches 20 cycles after fetch, and has dynamic branch frequency .
The branch cost is
where is the misprediction probability and is the penalty in cycles. With , a predictor that misses 20% of branches adds
extra CPI. A predictor that misses 5% adds
extra CPI. The difference is 0.6 cycles per instruction. On a 3 GHz core, one hardware thread with a 1-instruction-per-cycle front end loses about instruction slots per second.
A branch consumes front-end bandwidth even when predicted correctly. On a wrong path, fetched bytes, decoded micro-ops, rename entries, issue queue entries, and cache bandwidth can all be spent on instructions that never retire. The architectural state rolls back; the microarchitectural work already happened.
Static Prediction
Static prediction uses only the instruction, its address, or a compiler hint. A common rule is backward taken and forward not taken. A backward conditional branch often closes a loop, so predicting taken repeats the loop body. A forward conditional branch often skips a rare block, so predicting not taken continues the fall-through path.
Consider this C loop:
int sum_positive(const int *a, int n) {
int s = 0;
for (int i = 0; i < n; i++) {
if (a[i] > 0) s += a[i];
}
return s;
}
A typical loop branch at the bottom jumps backward while i < n. For n = 1000, the loop branch is taken 999 times and not taken once. Backward-taken gets 999 correct and 1 wrong.
The inner if depends on data. If signs are random, a static forward-not-taken rule reaches about 50% on that branch. With a 20-cycle penalty, the branch can dominate the cost of the integer add.
A stylized x86-64 loop branch looks like this:
.Lloop:
mov eax, DWORD PTR [rdi + rcx*4]
test eax, eax
jle .Lskip # forward branch, static rule often predicts not taken
add edx, eax
.Lskip:
inc rcx
cmp rcx, rsi
jl .Lloop # backward branch, static rule often predicts taken
Static rules are cheap, but they cannot learn that a specific forward branch is usually taken or that a loop branch exits every 8th iteration.
One-Bit and Two-Bit Local Predictors
A 1-bit predictor stores the last observed direction for each BHT entry. Its state is one bit.
| State bit | Prediction | After actual not taken | After actual taken |
|---|---|---|---|
| 0 | not taken | 0 | 1 |
| 1 | taken | 0 | 1 |
For a loop branch with directions T T T T N, repeated, the 1-bit predictor does poorly at loop boundaries. Assume it starts in taken state.
| Branch outcome | Prediction | Correct? | Next state |
|---|---|---|---|
| T | T | yes | T |
| T | T | yes | T |
| T | T | yes | T |
| T | T | yes | T |
| N | T | no | N |
| T | N | no | T |
It misses on the exit and again on the first iteration of the next loop. The single bit flips too quickly.
A 2-bit saturating counter fixes this common case. The four states are usually named strongly not taken, weakly not taken, weakly taken, and strongly taken. Prediction uses the high bit. Actual taken increments the counter up to 3. Actual not taken decrements it down to 0.
| Counter | Bits | Prediction |
|---|---|---|
| 0 | 00 | not taken |
| 1 | 01 | not taken |
| 2 | 10 | taken |
| 3 | 11 | taken |
For the same repeated T T T T N pattern, starting at 11, the exit changes the state to 10 but the next loop entry is still predicted taken.
#include <stdint.h>
static uint8_t update2(uint8_t c, int taken) {
if (taken) return c == 3 ? 3 : (uint8_t)(c + 1);
return c == 0 ? 0 : (uint8_t)(c - 1);
}
static int predict2(uint8_t c) {
return c >= 2;
}
A 4096-entry table of 2-bit counters needs 8192 bits, or 1024 bytes, excluding tags and control. If indexed by low PC bits, entry index might be (pc >> 2) & 0xfff for 4-byte aligned instructions. That shift discards the two zero alignment bits.
Example indexes:
| Branch PC | pc >> 2 | 12-bit BHT index |
|---|---|---|
0x400640 | 0x100190 | 0x190 |
0x401640 | 0x100590 | 0x590 |
0x500640 | 0x140190 | 0x190 |
The first and third PCs alias in the BHT. If one branch is usually taken and the other usually not taken, they fight over the same counter. This aliasing is destructive when unrelated branches map to one entry.
Two-Level Adaptive Prediction
Yeh-Patt two-level adaptive prediction adds history. A global history register, GHR, records the last branch outcomes as bits. A pattern history table stores counters indexed by a combination of PC bits and the GHR.
Let T = 1 and N = 0. With a 3-bit GHR, the recent sequence T N T can be stored as binary 101. On each resolved branch,
ghr = ((ghr << 1) | taken) & 0x7;
A simple gshare-style index xors PC bits with GHR:
uint32_t index(uint64_t pc, uint32_t ghr) {
uint32_t pc_index = (uint32_t)(pc >> 2) & 0x7;
return pc_index ^ ghr;
}
Worked trace with an 8-entry pattern table, all counters initialized to weakly taken 10, and branch PC index 011:
| Step | GHR before | Index 011 xor GHR | Prediction | Actual | Counter after | GHR after |
|---|---|---|---|---|---|---|
| 1 | 000 | 011 | T | T | 11 | 001 |
| 2 | 001 | 010 | T | N | 01 | 010 |
| 3 | 010 | 001 | T | T | 11 | 101 |
| 4 | 101 | 110 | T | N | 01 | 010 |
| 5 | 010 | 001 | T | T | 11 | 101 |
The same static branch can use different counters under different histories. This captures patterns such as "this branch is taken if the previous comparison failed" or alternating paths in small decision trees.
Two-level predictors still have limits. The table is finite. Different (PC, history) pairs collide. Long histories cost storage and update energy, and speculative history must be repaired after a wrong path.
TAGE Predictors
TAGE stands for tagged geometric history length. It is a family of predictors built from several predictor tables, each indexed using a different history length. The history lengths grow roughly geometrically, such as 0, 4, 8, 16, 32, 64, though real designs choose sizes and folding functions with care.
Each tagged table entry stores a prediction counter, a partial tag, and metadata that tracks usefulness. A lookup consults multiple tables. A table with a matching tag and the longest history usually supplies the prediction. A shorter matching table can act as an alternate prediction when the long-history entry is weak or untrusted.
A compact conceptual entry layout might be:
| Field | Bits | Meaning |
|---|---|---|
| prediction counter | 3 | signed or saturating direction confidence |
| tag | 10 | partial match to reduce aliasing |
| usefulness | 2 | replacement preference |
| total | 15 | per entry before array overhead |
The byte-level idea is not that the entry must be exactly 15 bits. Hardware packs arrays for area and timing. The point is that TAGE spends bits on tags so that a long-history prediction is not blindly shared by many unrelated branches.
Example lookup:
| Table | History length | Index | Stored tag | Computed tag | Match? | Counter |
|---|---|---|---|---|---|---|
| base | 0 | 0x190 | none | none | yes | weak T |
| T1 | 4 | 0x033 | 0x12a | 0x12a | yes | strong N |
| T2 | 16 | 0x2c1 | 0x071 | 0x070 | no | weak T |
| T3 | 64 | 0x091 | 0x155 | 0x155 | yes | weak T |
The provider is T3 because it is the longest matching table. If T3 is weak and the alternate T1 says strong not taken, update rules may train the provider, allocate another entry on a miss, or prefer the alternate for a while. Full TAGE designs contain details beyond this page, but the operating picture is simple. Many history lengths compete, tags reduce aliasing, and replacement tries to keep entries that have paid rent.
BTB, Indirect Branches, and Returns
Direction prediction is not enough. A taken branch needs a target address early enough to keep fetch busy. The BTB maps a branch PC to a predicted target. A direct conditional branch has an immediate offset in the instruction, but waiting until decode may still be too late for a wide, deep front end.
A BTB entry conceptually contains:
| Field | Example |
|---|---|
| branch PC tag | high bits of 0x400640 |
| predicted target | 0x4005f0 |
| branch type | conditional direct |
| valid bit | 1 |
Indirect branches are harder because the target comes from a register or memory. C++ virtual calls, jump tables, interpreter dispatch, and function pointers compile to indirect control transfers. The same branch PC can jump to many targets.
struct Op {
virtual int apply(int x) const = 0;
};
int run(const Op **ops, int n, int x) {
for (int i = 0; i < n; i++) {
x = ops[i]->apply(x); // often an indirect call
}
return x;
}
A direct call has a target encoded by the instruction. An indirect call resembles:
mov rax, QWORD PTR [rdi + rcx*8]
mov rax, QWORD PTR [rax]
call QWORD PTR [rax + 16] # target depends on object type
Returns use a special return address predictor, often a small return-address stack. A call pushes the next PC. A ret predicts the top. Deep recursion, exceptions, coroutines, and mismatched call-return behavior can disturb this structure.
Branchy Code, cmov, and If-Conversion
A branch is fast when it is predictable and slow when the data distribution defeats the predictor. The classic example is summing positive values from an array with random signs.
Branch version:
int sum_pos_branch(const int *a, int n) {
int s = 0;
for (int i = 0; i < n; i++) {
if (a[i] > 0) s += a[i];
}
return s;
}
Branchless version:
int sum_pos_mask(const int *a, int n) {
int s = 0;
for (int i = 0; i < n; i++) {
int x = a[i];
s += x & -(x > 0);
}
return s;
}
The mask expression makes (x > 0) produce 0 or 1, negates it to 0x00000000 or 0xffffffff, and masks out negative values. For x = 5, the mask is 0xffffffff, so the contribution is 5. For x = -7, the mask is 0, so the contribution is 0.
A compiler may also use cmov on x86-64:
mov eax, DWORD PTR [rdi + rcx*4]
test eax, eax
mov r8d, 0
cmovg r8d, eax
add edx, r8d
cmovg avoids a control-flow change, so there is no direction misprediction. It still creates a data dependency, and both source computations may need to be available. A predictable branch can be better than cmov when one side is expensive or rarely executed. A branchless form can be better when the branch outcome is close to random and both sides are cheap.
Spectre v1 exploits speculative execution after a mispredicted bounds check. The predictor trains the CPU to enter a path before the bounds check resolves, and transient instructions can touch cache lines based on out-of-bounds data. The architectural state rolls back, but cache timing can reveal information. See /computationpath/spectre-v1 for the security model.
Key Result
For first-order performance estimates, use this invariant.
With a 20-stage pipeline, 20% branch density, and 4 cycles lost per instruction stream per full misprediction point, the branch term is easy to compute. Since , each absolute 1% of misprediction rate costs CPI.
| Accuracy | Misprediction rate | Extra CPI |
|---|---|---|
| 80% | 20% | 0.80 |
| 90% | 10% | 0.40 |
| 95% | 5% | 0.20 |
| 98% | 2% | 0.08 |
The practical invariant is that direction and target must both be available before the front end runs out of useful fetch work. A correct direction with a missing BTB target still stalls taken-branch fetch. A correct target with a wrong direction still flushes the path.
Common Confusions
A 2-bit counter does not remember the last two outcomes
A 2-bit saturating counter is not a two-event shift register. State 10 means weakly taken, not "the last two outcomes were taken then not taken." It is a small confidence state. The sequence T T T N from 11 ends at 10, while N T T T from 00 ends at 11; the same recent suffix can occur with different confidence.
A high branch prediction rate does not make branchy code free
A branch that misses 5% of the time can still dominate a tight loop if the loop body is only a few instructions. With a 20-cycle penalty and 5% misses, the expected cost is 1 cycle per dynamic branch. That is already larger than an integer add on many cores.
The BTB is not the same structure as the direction predictor
The direction predictor says taken or not taken. The BTB supplies the fetch address for taken control flow. A conditional branch can have a correct direction prediction and still lose cycles if its target is absent from the BTB.
Exercises
Problem
A loop branch has outcome pattern T T T T T T T N, repeated. Compare a 1-bit predictor and a 2-bit saturating counter. Both start in taken state, with the 2-bit counter at 11. Count misses over two full repetitions.
Problem
A machine has base CPI 0.9, branch frequency 18%, and misprediction penalty 17 cycles. Compute CPI at 92% and 97% prediction accuracy.
Problem
A BHT has 1024 entries of 2-bit counters indexed by (pc >> 2) & 0x3ff. Compute the indexes for branch PCs 0x400640, 0x401640, and 0x400644. State which aliases occur.
References
Canonical:
- John L. Hennessy and David A. Patterson, Computer Architecture: A Quantitative Approach, 6th ed. (2017), §1.5 and §3.3, quantitative CPI modeling and control hazards
- Randal E. Bryant and David R. O'Hallaron, Computer Systems: A Programmer's Perspective, 3rd ed. (2016), ch. 4 and §5.7, pipelined processors and branch effects in optimized code
- David A. Patterson and John L. Hennessy, Computer Organization and Design RISC-V Edition, 2nd ed. (2020), §4.8, control hazards and branch prediction
- John Paul Shen and Mikko H. Lipasti, Modern Processor Design (2005), ch. 5, branch prediction and speculative execution
- Tse-Yu Yeh and Yale N. Patt, "Two-Level Adaptive Training Branch Prediction," MICRO-24 (1991), original two-level adaptive predictor paper
- André Seznec and Pierre Michaud, "A Case for Partially Tagged Geometric History Length Branch Prediction," Journal of Instruction-Level Parallelism 8 (2006), TAGE predictor family
Accessible:
- Agner Fog, The microarchitecture of Intel, AMD and VIA CPUs, sections on branch prediction and branch target buffers
- Dan Luu, "Branch prediction," an accessible worked explanation with 1-bit and 2-bit predictor examples
- MIT 6.823 Computer System Architecture lecture notes, branch prediction lectures
Next Topics
/computationpath/pipelining/computationpath/out-of-order-execution/computationpath/cache-hierarchy/computationpath/spectre-v1/topics/markov-chains