Distributed Systems · Algorithms · Computer Science
📅 Квітень 2026 ⏱ ≈ 14 хв читання 🎯 Intermediate–Advanced

Raft Consensus Algorithm — Leader Election and Log Replication

How does a cluster of servers agree on the same sequence of operations when some servers may crash or messages may be delayed? This is the fundamental problem of distributed consensus — and it is surprisingly hard. Raft, designed by Diego Ongaro and John Ousterhout in 2014 as an alternative to Paxos, achieves strong consistency in a way that is explicitly designed to be understandable. It powers etcd (Kubernetes), CockroachDB, TiKV, and many other production distributed systems.

1. The Consensus Problem

A distributed system is consistent if all non-faulty nodes agree on the same sequence of committed values (a replicated log). Clients submit commands; the system must ensure:

The FLP impossibility theorem (Fischer, Lynch, Paterson, 1985) proves that in a fully asynchronous system, no deterministic algorithm can guarantee both safety and liveness when even one node may fail. Practical systems sidestep this by using timeouts (weak synchrony assumption) — Raft is no exception.

Quorum (majority) requirement: with N servers, Raft can tolerate ⌊(N-1)/2⌋ simultaneous failures. A cluster of 3 tolerates 1 failure; 5 servers tolerate 2. Progress requires a majority (N/2 + 1) of votes or acknowledgements. This prevents "split-brain" — two groups simultaneously electing different leaders.

2. Server Roles and Terms

At any time, each Raft server is in one of three states:

Time is divided into terms — monotonically increasing integers. Each term begins with an election. If a candidate wins, it serves as leader for the rest of the term. If no candidate wins (e.g., split vote), the term ends with no leader and a new term begins.

Terms serve as logical clocks: - Every RPC includes the sender's current term. - If a server receives a message with term T > its own term → update term to T, convert to follower if needed. - If a server receives a message with term T < its own term → reject the message. This ensures stale leaders immediately step down when a new term is detected.

3. Leader Election

When a follower's election timeout expires (no heartbeat received), it transitions to candidate, increments its current term, votes for itself, and sends RequestVote RPCs to all other servers:

// RequestVote RPC arguments
type RequestVoteArgs struct {
  Term         int    // candidate's term
  CandidateId  int    // candidate requesting vote
  LastLogIndex int    // index of candidate's last log entry
  LastLogTerm  int    // term of candidate's last log entry
}

// Voter grants vote if:
// 1. haven't voted in this term yet (or voted for this candidate)
// 2. candidate's log is at least as up-to-date as receiver's log
func handleRequestVote(args RequestVoteArgs) bool {
  if args.Term < currentTerm { return false }
  logUpToDate := args.LastLogTerm > lastLogTerm ||
    (args.LastLogTerm == lastLogTerm && args.LastLogIndex >= lastLogIndex)
  if (votedFor == -1 || votedFor == args.CandidateId) && logUpToDate {
    votedFor = args.CandidateId
    return true
  }
  return false
}

If the candidate receives votes from a majority of servers, it becomes the new leader and immediately sends heartbeat AppendEntries RPCs to suppress further elections.

The election timeout is randomized (typically 150–300 ms) to avoid split votes where many candidates start simultaneously. If a split vote occurs, all candidates time out and start a new election.

4. Log Replication

Once elected, the leader accepts client commands, appends them to its log, and replicates them to followers via AppendEntries RPCs:

  1. Client sends command to leader.
  2. Leader appends entry to local log with current term number.
  3. Leader sends AppendEntries to all followers in parallel (also serves as heartbeat when empty).
  4. Once a majority of servers have written the entry to their logs, the leader commits the entry — applies it to its state machine and responds to the client.
  5. Leader includes commitIndex in subsequent AppendEntries; followers commit up to that index.
AppendEntries consistency check: Each AppendEntries RPC includes: - prevLogIndex: index of log entry immediately preceding new entries - prevLogTerm: term of that entry Follower rejects if its log doesn't have an entry at prevLogIndex with prevLogTerm. → Leader retries with decremented prevLogIndex until a match is found. → Once consistent prefix found, follower overwrites any conflicting entries. This guarantees: if two logs have an entry with the same (index, term), all entries with earlier indices are identical — "Log Matching Property."

5. Safety Guarantees

Raft's correctness rests on the Leader Completeness Property: if a log entry is committed in a given term, that entry will be present in the logs of all leaders for all higher-numbered terms.

This is ensured by the election restriction: a candidate cannot win an election unless its log is at least as up-to-date as any other server in the majority that votes for it. "Up-to-date" is defined by comparing (lastLogTerm, lastLogIndex) — higher term wins; if equal, longer log wins.

Why committed entries are safe across leader changes: A committed entry has been written to a majority of servers. Any new leader must receive votes from a majority to win. The two majorities overlap by at least one server — that server has the committed entry and will only vote for a candidate whose log contains it. Therefore the new leader's log already contains all committed entries.

Raft also prevents a subtle bug: a leader cannot commit entries from previous terms by counting replicas — it can only commit them by committing a new entry from its own term (which causes the older entries to be committed implicitly via the Log Matching Property).

6. Cluster Membership Changes

Adding or removing servers requires care — a naive instantaneous switch could create two majorities simultaneously. Raft originally proposed a joint consensus approach transitioning through a combined old+new configuration. Modern implementations typically use the simpler single-server membership change protocol:

Newly added servers start as non-voting learners that replicate the log without participating in elections or counting toward quorum — preventing them from delaying elections while catching up on potentially petabytes of historical log.

7. Comparison with Paxos

Property Paxos (multi-Paxos) Raft ────────────────────────────────────────────────────── Understandability Complex, many variants Explicitly designed to be clear Leader election Implicit (highest ballot) Explicit election phase Log structure Order-independent slots Strictly ordered log Leader transfers Not specified Built-in LeaderTransfer RPC Snapshots Not specified Built-in InstallSnapshot RPC Industrial use Chubby, Spanner, ZAB* etcd, CockroachDB, TiKV, consul * ZAB is Paxos-inspired, not pure Paxos

Raft and multi-Paxos solve the same problem and have similar performance characteristics. Raft's main advantage is specification clarity — the original Ongaro & Ousterhout paper includes a formal proof of correctness and a usability study showing students understood Raft better than Paxos after equivalent study time. This understandability translates to fewer bugs in production implementations.

The etcd key-value store uses Raft and forms the coordination heart of Kubernetes — every time a pod is scheduled, a node is drained, or a ConfigMap is updated, Raft consensus ensures the change is safely persisted across all etcd replicas before the API server acknowledges success.

🔗 Explore Distributed Systems →