Raft Consensus Algorithm — Leader Election and Log Replication
How does a cluster of servers agree on the same sequence of operations when some servers may crash or messages may be delayed? This is the fundamental problem of distributed consensus — and it is surprisingly hard. Raft, designed by Diego Ongaro and John Ousterhout in 2014 as an alternative to Paxos, achieves strong consistency in a way that is explicitly designed to be understandable. It powers etcd (Kubernetes), CockroachDB, TiKV, and many other production distributed systems.
1. The Consensus Problem
A distributed system is consistent if all non-faulty nodes agree on the same sequence of committed values (a replicated log). Clients submit commands; the system must ensure:
- Safety: committed entries are never lost or changed, and all nodes that commit apply the same entry at each log index.
- Liveness: the system continues to make progress (commit new entries) as long as a majority of nodes are reachable and operational.
The FLP impossibility theorem (Fischer, Lynch, Paterson, 1985) proves that in a fully asynchronous system, no deterministic algorithm can guarantee both safety and liveness when even one node may fail. Practical systems sidestep this by using timeouts (weak synchrony assumption) — Raft is no exception.
2. Server Roles and Terms
At any time, each Raft server is in one of three states:
- Leader: accepts client requests, replicates log entries to followers, sends heartbeats. At most one leader per term.
- Follower: passive; responds to RPCs from the leader and candidates. If no heartbeat received within election timeout, converts to candidate.
- Candidate: trying to become leader; sends RequestVote RPCs to all other servers.
Time is divided into terms — monotonically increasing integers. Each term begins with an election. If a candidate wins, it serves as leader for the rest of the term. If no candidate wins (e.g., split vote), the term ends with no leader and a new term begins.
3. Leader Election
When a follower's election timeout expires (no
heartbeat received), it transitions to candidate, increments its
current term, votes for itself, and sends
RequestVote RPCs to all other servers:
// RequestVote RPC arguments
type RequestVoteArgs struct {
Term int // candidate's term
CandidateId int // candidate requesting vote
LastLogIndex int // index of candidate's last log entry
LastLogTerm int // term of candidate's last log entry
}
// Voter grants vote if:
// 1. haven't voted in this term yet (or voted for this candidate)
// 2. candidate's log is at least as up-to-date as receiver's log
func handleRequestVote(args RequestVoteArgs) bool {
if args.Term < currentTerm { return false }
logUpToDate := args.LastLogTerm > lastLogTerm ||
(args.LastLogTerm == lastLogTerm && args.LastLogIndex >= lastLogIndex)
if (votedFor == -1 || votedFor == args.CandidateId) && logUpToDate {
votedFor = args.CandidateId
return true
}
return false
}
If the candidate receives votes from a majority of servers, it becomes
the new leader and immediately sends heartbeat
AppendEntries RPCs to suppress further elections.
The election timeout is randomized (typically 150–300 ms) to avoid split votes where many candidates start simultaneously. If a split vote occurs, all candidates time out and start a new election.
4. Log Replication
Once elected, the leader accepts client commands, appends them to its
log, and replicates them to followers via
AppendEntries RPCs:
- Client sends command to leader.
- Leader appends entry to local log with current term number.
-
Leader sends
AppendEntriesto all followers in parallel (also serves as heartbeat when empty). - Once a majority of servers have written the entry to their logs, the leader commits the entry — applies it to its state machine and responds to the client.
-
Leader includes
commitIndexin subsequentAppendEntries; followers commit up to that index.
5. Safety Guarantees
Raft's correctness rests on the Leader Completeness Property: if a log entry is committed in a given term, that entry will be present in the logs of all leaders for all higher-numbered terms.
This is ensured by the election restriction: a candidate cannot win an election unless its log is at least as up-to-date as any other server in the majority that votes for it. "Up-to-date" is defined by comparing (lastLogTerm, lastLogIndex) — higher term wins; if equal, longer log wins.
Raft also prevents a subtle bug: a leader cannot commit entries from previous terms by counting replicas — it can only commit them by committing a new entry from its own term (which causes the older entries to be committed implicitly via the Log Matching Property).
6. Cluster Membership Changes
Adding or removing servers requires care — a naive instantaneous switch could create two majorities simultaneously. Raft originally proposed a joint consensus approach transitioning through a combined old+new configuration. Modern implementations typically use the simpler single-server membership change protocol:
- One server at a time is added or removed.
- Because the old and new configurations overlap in majority, split-brain is impossible.
- The membership change is appended as a special log entry; the new config takes effect when committed.
Newly added servers start as non-voting learners that replicate the log without participating in elections or counting toward quorum — preventing them from delaying elections while catching up on potentially petabytes of historical log.
7. Comparison with Paxos
Raft and multi-Paxos solve the same problem and have similar performance characteristics. Raft's main advantage is specification clarity — the original Ongaro & Ousterhout paper includes a formal proof of correctness and a usability study showing students understood Raft better than Paxos after equivalent study time. This understandability translates to fewer bugs in production implementations.
The etcd key-value store uses Raft and forms the coordination heart of Kubernetes — every time a pod is scheduled, a node is drained, or a ConfigMap is updated, Raft consensus ensures the change is safely persisted across all etcd replicas before the API server acknowledges success.