Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Systems

Distributed Consensus

How do distributed nodes agree on a value? The FLP impossibility result, Paxos, Raft, Byzantine fault tolerance, and why consensus underpins replicated state machines from databases to distributed ML training.

AdvancedTier 2Stable~45 min
0

Why This Matters

Distributed systems must make decisions: which transaction committed first, what the current state of the database is, which parameter server update to apply. Consensus is the problem of getting multiple nodes to agree on a single value, even when some nodes crash or messages are delayed. This problem is harder than it appears. The FLP impossibility theorem proves that no deterministic algorithm can guarantee consensus in an asynchronous network if even one node can crash. Every real consensus protocol (Paxos, Raft, PBFT) works around this impossibility by making additional assumptions. These protocols underpin every replicated database, distributed lock service, and coordination layer in modern infrastructure. In ML, consensus-like coordination appears in parameter servers, AllReduce synchronization barriers, and federated learning aggregation.

The Consensus Problem

Definition

Consensus

The consensus problem: nn processes each propose a value. They must agree on a single value from among the proposals. Three properties must hold:

  1. Agreement: no two correct processes decide different values.
  2. Validity: the decided value must have been proposed by some process.
  3. Termination: every correct process eventually decides a value.

Agreement and validity are safety properties (nothing bad happens). Termination is a liveness property (something good eventually happens).

Definition

System Model: Crash Failures

In the crash failure model, a faulty process simply stops executing. It does not send corrupted messages or behave maliciously. A process is either correct (follows the protocol forever) or faulty (crashes at some point and never recovers). Messages between correct processes are eventually delivered but with unbounded delay (asynchronous model) or within a known bound (synchronous model).

Definition

System Model: Byzantine Failures

In the Byzantine failure model, a faulty process can behave arbitrarily: send conflicting messages to different processes, delay responses, or actively try to prevent agreement. This models adversarial behavior, software bugs, or hardware corruption. Byzantine faults are strictly harder to tolerate than crash faults.

The FLP Impossibility Theorem

Theorem

FLP Impossibility

Statement

There is no deterministic consensus protocol that satisfies agreement, validity, and termination in an asynchronous system if even one process may crash.

Formally: for any deterministic protocol PP with n2n \geq 2 processes and at most f=1f = 1 crash failure, there exists an admissible execution of PP in which no process ever decides.

Intuition

The core difficulty is that a slow process and a crashed process are indistinguishable in an asynchronous network. If you wait for a slow process, you may wait forever (violating termination). If you proceed without it, you risk a split decision if the "slow" process was actually alive and about to send a different proposal. The FLP proof constructs an adversarial scheduler that keeps the system in an indeterminate state forever by carefully delaying messages.

Proof Sketch

The proof proceeds in two parts.

Part 1: Bivalent initial configurations exist. A configuration is bivalent if both 0 and 1 are still possible decisions depending on future events. A configuration is univalent (0-valent or 1-valent) if only one decision is reachable. By a counting argument, at least one initial configuration must be bivalent. Consider initial configurations that differ in one process's input. Adjacent configurations (differing in one process) that lead to different decisions must have a bivalent configuration between them, because if the differing process crashes before acting, the remaining processes cannot distinguish the two configurations.

Part 2: From any bivalent configuration, another bivalent configuration is reachable. Consider a pending message e=(p,m)e = (p, m) (message mm to process pp). Let C\mathcal{C} be the set of configurations reachable from the current bivalent configuration without delivering ee, and let D\mathcal{D} be the configurations obtained by then delivering ee. If D\mathcal{D} contains only univalent configurations, it must contain both 0-valent and 1-valent ones (otherwise the original configuration would be univalent). By examining configurations that differ only in whether ee is delivered before or after some other message, and considering what happens if process pp crashes, we derive a contradiction. Therefore D\mathcal{D} must contain a bivalent configuration.

By repeatedly applying Part 2, the adversary constructs an infinite execution that never reaches a decision.

Why It Matters

FLP is one of the most important impossibility results in computer science. It explains why every practical consensus protocol must weaken one of the three assumptions: use synchrony assumptions (timeout-based failure detectors), use randomization (probabilistic termination), or sacrifice termination in some executions. Paxos and Raft use timeouts (partial synchrony). Ben-Or's protocol uses randomization. Understanding FLP is necessary to understand why consensus protocols are designed the way they are.

Failure Mode

FLP applies only to deterministic protocols in purely asynchronous systems. It does not say consensus is impossible in practice. Practical systems are not purely asynchronous: messages usually arrive within bounded time, and timeout mechanisms detect likely failures. Randomized protocols (which are not deterministic) can achieve consensus with probability 1 even in asynchronous systems. FLP is a theoretical lower bound, not a practical prohibition.

Working Around FLP: Partial Synchrony

Definition

Partial Synchrony

The partial synchrony model (Dwork, Lynch, Stockmeyer 1988) assumes that there exists an unknown bound Δ\Delta on message delivery time and an unknown time TT after which the bound holds. Before time TT, the system may behave asynchronously. After TT, messages arrive within Δ\Delta. This model is realistic: networks are usually fast but occasionally experience delays. Consensus is solvable in this model with n2f+1n \geq 2f + 1 processes for ff crash faults.

Paxos

Definition

Paxos Protocol

Paxos (Lamport 1998) solves consensus under crash failures with the following roles:

  • Proposers: suggest values
  • Acceptors: vote on proposals (a majority quorum suffices)
  • Learners: learn the decided value

The protocol has two phases:

Phase 1 (Prepare):

  1. Proposer selects a unique proposal number nn and sends Prepare(n)\text{Prepare}(n) to a majority of acceptors.
  2. Each acceptor responds with a promise not to accept proposals numbered less than nn, along with the highest-numbered proposal it has already accepted (if any).

Phase 2 (Accept):

  1. If the proposer receives promises from a majority, it sends Accept(n,v)\text{Accept}(n, v) where vv is the value from the highest-numbered previously accepted proposal (or the proposer's own value if no prior proposals exist).
  2. Each acceptor accepts the proposal if it has not promised to a higher-numbered proposal.

A value is chosen when a majority of acceptors have accepted it.

Theorem

Paxos Safety

Statement

Paxos satisfies agreement and validity: if a value vv is chosen (accepted by a majority), then no different value vvv' \neq v can be chosen.

Intuition

The two-phase structure ensures that once a majority accepts vv, any future proposer will discover vv in Phase 1 (because its quorum must overlap with the accepting majority by at least one acceptor) and will be forced to re-propose vv rather than a new value. The majority quorum intersection is the key invariant: any two majorities share at least one member.

Proof Sketch

By induction on proposal number. Suppose value vv is chosen with proposal number nn. For any proposal n>nn' > n: in Phase 1 of proposal nn', the proposer contacts a majority. This majority overlaps with the majority that accepted (n,v)(n, v). The overlapping acceptor reports having accepted (n,v)(n, v). The proposer must use the value from the highest-numbered accepted proposal among the Phase 1 responses. By inductive hypothesis, all accepted proposals between nn and nn' also have value vv. Therefore the proposer proposes vv.

Why It Matters

Paxos is the foundational consensus algorithm. Google's Chubby lock service, Apache ZooKeeper (originally based on Zab, a Paxos variant), and many distributed databases use Paxos or a direct descendant. Understanding Paxos clarifies the design of every modern consensus system.

Failure Mode

Paxos does not guarantee termination (liveness) without additional assumptions. If two proposers repeatedly preempt each other with increasing proposal numbers (a livelock scenario), no value is ever chosen. Practical implementations use a leader election mechanism: a single distinguished proposer avoids conflicts. Multi-Paxos extends single-decree Paxos to a sequence of decisions (a replicated log) by reusing the leader across many consensus instances.

Raft

Definition

Raft Protocol

Raft (Ongaro and Ousterhout 2014) is a consensus algorithm designed for understandability, equivalent in power to Multi-Paxos but with a clearer decomposition into subproblems:

  1. Leader election: nodes are in one of three states: follower, candidate, or leader. A leader sends periodic heartbeats. If a follower receives no heartbeat within a randomized timeout, it becomes a candidate and requests votes. A candidate receiving votes from a majority becomes the leader for that term (a monotonically increasing epoch number).

  2. Log replication: the leader receives client requests, appends them to its log, and replicates log entries to followers. An entry is committed once the leader has replicated it to a majority. Committed entries are durable and will not be lost.

  3. Safety: a candidate can win an election only if its log is at least as up-to-date as a majority of nodes (the election restriction). This ensures the leader always has all committed entries.

Raft's key simplification over Paxos: the leader is the single point of coordination. All client requests go through the leader, and log entries flow in one direction (leader to followers). This eliminates the complex multi-proposer scenarios in Paxos and makes the protocol easier to implement correctly.

Byzantine Fault Tolerance

Definition

Byzantine Generals Problem

The Byzantine Generals Problem (Lamport, Shostak, Pease 1982): nn generals must agree on a battle plan (attack or retreat). Some generals may be traitors who send conflicting messages. The loyal generals must agree on the same plan, and if all loyal generals propose the same plan, that plan must be chosen. This models arbitrary (Byzantine) failures in distributed systems.

Theorem

Byzantine Fault Tolerance Lower Bound

Statement

Consensus is impossible with n3fn \leq 3f processes if ff processes are Byzantine. Equivalently, Byzantine consensus requires n3f+1n \geq 3f + 1 processes.

Intuition

With 3 generals and 1 traitor, the traitor can tell one loyal general "attack" and the other "retreat." Each loyal general sees one vote for attack and one for retreat, plus its own proposal. Without knowing who the traitor is, neither can determine the correct plan. The 3f+13f + 1 bound generalizes this: you need enough honest nodes to form a quorum that outweighs the Byzantine nodes.

Proof Sketch

The proof uses a simulation argument. Suppose a protocol works with n=3fn = 3f processes and ff Byzantine faults. Partition the processes into three groups AA, BB, CC of size ff. Construct three scenarios:

  1. CC is Byzantine, processes in AA propose 0, processes in BB propose 1. The protocol must decide 0 or 1.
  2. AA is Byzantine. Honest BB and CC must agree.
  3. BB is Byzantine. Honest AA and CC must agree.

In each scenario, the Byzantine group simulates the behavior of honest processes from another scenario. The honest processes cannot distinguish scenarios 1 and 2 (or 1 and 3), leading to contradictory decisions. The contradiction proves n=3fn = 3f is insufficient.

Why It Matters

The 3f+13f + 1 bound sets the fundamental cost of Byzantine tolerance: you need more than three times as many total nodes as faulty ones. PBFT (Practical BFT) by Castro and Liskov (1999) achieves this bound with O(n2)O(n^2) message complexity. Modern blockchain consensus protocols (Tendermint, HotStuff) build on this foundation.

Failure Mode

The 3f+13f + 1 bound assumes no digital signatures (authenticated channels only). With digital signatures, the bound relaxes to n2f+1n \geq 2f + 1 for synchronous systems because traitors cannot forge messages. However, the FLP impossibility still applies in asynchronous settings even with signatures, so practical BFT protocols require partial synchrony assumptions for liveness.

PBFT (Practical Byzantine Fault Tolerance)

PBFT (Castro and Liskov 1999) is the first practical BFT protocol for asynchronous systems (with a synchrony assumption for liveness). It tolerates ff Byzantine faults among n=3f+1n = 3f + 1 nodes.

The protocol operates in three phases for each client request:

  1. Pre-prepare: the leader (primary) assigns a sequence number to the request and broadcasts a pre-prepare message.
  2. Prepare: each replica broadcasts a prepare message. A replica is "prepared" when it receives 2f2f matching prepare messages (plus its own).
  3. Commit: each prepared replica broadcasts a commit message. A replica commits when it receives 2f+12f + 1 commit messages.

The total message complexity per decision is O(n2)O(n^2), which limits scalability. HotStuff (2019) reduces this to O(n)O(n) using a linear communication pattern with threshold signatures.

Replicated State Machines

Definition

Replicated State Machine (RSM)

A replicated state machine maintains identical copies of a deterministic state machine on multiple nodes. Consensus ensures all replicas process the same sequence of commands in the same order:

  1. Clients submit commands to the consensus layer.
  2. Consensus assigns a total order to commands (the replicated log).
  3. Each replica applies commands in order, maintaining identical state.

If the state machine is deterministic and all replicas start from the same initial state, they will always be in the same state. This is the standard architecture for fault-tolerant services: etcd (Raft), ZooKeeper (Zab), CockroachDB (Raft), Spanner (Paxos).

Relevance to Distributed ML

Consensus protocols do not appear directly in most ML training loops (the performance overhead would be unacceptable). But consensus-adjacent coordination problems arise:

Parameter servers: a centralized server aggregates gradients from workers. This is not consensus (single point of coordination), but fault tolerance of the parameter server requires replication, which requires consensus.

AllReduce: workers collectively sum gradients using ring or tree AllReduce. This is a collective communication primitive, not consensus. But it requires a synchronization barrier: all workers must contribute before proceeding. If a worker crashes, the system must either restart the step or replace the worker, both of which involve coordination.

Federated learning: multiple clients send model updates to a central aggregator. If the aggregator must be fault-tolerant, it needs consensus-based replication. If clients are untrusted (potentially adversarial), the aggregation problem resembles Byzantine agreement.

Checkpoint coordination: during distributed training, a consistent checkpoint requires all workers to pause at the same logical step. This is a variant of the barrier synchronization problem, related to consensus.

Common Confusions

Watch Out

FLP does not mean consensus is impossible in practice

FLP proves impossibility for deterministic protocols in purely asynchronous systems. Real systems are partially synchronous (messages usually arrive on time), and practical protocols use timeouts and leader election to ensure progress. FLP is a theoretical boundary, not a practical prohibition. Paxos and Raft work reliably in production despite FLP.

Watch Out

Crash fault tolerance vs Byzantine fault tolerance

Crash-tolerant protocols (Paxos, Raft) need n2f+1n \geq 2f + 1 nodes and are simpler and faster. Byzantine-tolerant protocols need n3f+1n \geq 3f + 1 nodes and have higher message complexity. Most datacenter systems use crash-tolerant protocols because nodes are trusted. Byzantine tolerance is needed only when nodes may be malicious (blockchains, cross-organizational systems).

Watch Out

Consensus is not the same as leader election

Leader election is a subproblem used by many consensus protocols, but consensus and leader election are distinct problems. Consensus requires agreement on a value. Leader election requires agreement on a coordinator. In Raft, leader election is a mechanism to simplify consensus, not the consensus itself.

Exercises

ExerciseCore

Problem

A Paxos system has 5 acceptors. What is the minimum quorum size? If 2 acceptors crash, can the system still make progress? What about 3 crashes?

ExerciseCore

Problem

Explain why the FLP impossibility result does not apply to Raft in practice. What assumption does Raft make that FLP does not allow?

ExerciseAdvanced

Problem

A PBFT system with n=7n = 7 nodes can tolerate how many Byzantine faults? During the prepare phase, how many matching prepare messages must a replica collect (excluding its own) before it is considered "prepared"?

ExerciseAdvanced

Problem

Prove that any two majority quorums of nn processes overlap by at least one process. Then explain why this property is necessary for Paxos safety.

References

Canonical:

  • Fischer, Lynch & Paterson, "Impossibility of Distributed Consensus with One Faulty Process" (JACM 1985)
  • Lamport, "The Part-Time Parliament" (ACM TOCS 1998), Paxos protocol
  • Lamport, Shostak & Pease, "The Byzantine Generals Problem" (ACM TOPLAS 1982)

Current:

  • Ongaro & Ousterhout, "In Search of an Understandable Consensus Algorithm" (USENIX ATC 2014), Raft
  • Castro & Liskov, "Practical Byzantine Fault Tolerance" (OSDI 1999), PBFT
  • Yin et al., "HotStuff: BFT Consensus with Linearity and Responsiveness" (PODC 2019)
  • Cachin, Guerraoui & Rodrigues, Introduction to Reliable and Secure Distributed Programming (2nd ed., 2011), Chapters 5-6

Next Topics

  • CAP theorem: the fundamental tradeoff between consistency, availability, and partition tolerance in distributed systems

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics