Notes from Distributed Systems. Split into two sections: Kleppmann lecture notes covering the core theory, and paper reading notes covering the seminal systems papers.
Kleppmann Notes
1.1 Introduction to Distributed Systems
No shared address space. Multiple devices communicate over network for one objective.
Why distribute?
- Inherently distributed apps (e.g. messaging across the world)
- Better reliability — survive node failures
- Better perf — locality; solve bigger problems
1.2 Network Abstractions
Model: node i → message m → node j. Real nets use different physical media.
- Latency (seconds) & Bandwidth (bits/s) are the key properties
- Observe at message level (Charles) or packet level (Wireshark)
1.3 Remote Procedure Calls (RPC)
RPC = local function call that runs on a remote machine. Goal: location transparency.
Flow
- Client calls stub → stub marshals args into msg m1 → sent to server
- Server unmarshals → runs real fn → marshals result → sends m2 back
- Client stub unmarshals m2 → returns value as if local
In practice
- Client–server: browser fetching web page
- Microservices / SOA: large app split into services on many nodes
- IDL (Interface Def. Lang) = language-agnostic type signatures → stubs for any lang
- RPC enables interoperability across languages
2.1 Two Generals Problem
Two armies coordinate attack via messengers that may be captured. No guaranteed agreement over unreliable channel.
- Online shops work because credit-card charges are revocable (escape hatch), not because problem is solved
2.2 Byzantine Generals
n generals, up to f traitors. Traitors send contradictory messages. Messages reliable; senders may lie.
- Honest generals don't know who is a traitor; traitors know each other
- Theorem: need ≥ 3f + 1 generals total to tolerate f traitors
2.3 System Models
Network
- Reliable — always arrives, in order (not realistic)
- Fair-loss — may be lost, retry eventually gets through
- Arbitrary — loss, duplicates, reorder, corruption; partitions possible
Node failure
- Crash-stop — crashes, stays silent forever
- Crash-recovery — crashes but may restart; in-memory state lost
- Byzantine — any deviation from algorithm (hardest)
Timing
- Synchronous — known bounds on delay & processing
- Partially synchronous — bounds exist but unknown, or sync after GST
- Asynchronous — no timing guarantees
Note: Pure synchrony unrealistic. Most systems assume partial synchrony.
2.4 Fault Tolerance
- Availability = uptime = fraction of time service runs correctly
- SLO = availability goal; SLA = contract specifying SLO
- Fault = one component misbehaves; Failure = whole system broken
Failure detectors
- Perfect FD — never wrong. Only in synchronous systems
- Via timeout — can't distinguish: crashed / slow / lost msg / delayed
- Eventually perfect FD — may be wrong briefly, eventually correct. Works in partial sync
3.1 Physical Time & 3.2 Clock Sync
- Physical clock = counts real seconds (drifts without correction)
- Logical clock = counts events (not real seconds)
- NTP — periodic network sync; PTP — GPS-synced, higher precision
- Monotone clock — only moves forward; safe for elapsed time
- Time-of-day clock — can jump backward; not safe for intervals
3.3 / 4.1 Ordering & Logical Time
'Happens before' (→) lets us order events without perfect clocks.
Lamport clocks
- Increment counter L on every event
- Attach L to every message sent
- On receive: L = max(own L, received L) + 1
- If a→b then L(a) < L(b). Converse NOT guaranteed.
Vector clocks
- Each node keeps vector VC[n] (one entry per node)
- On send: increment own entry, attach whole vector
- On receive: element-wise max, then increment own entry
- VC(a) < VC(b) iff a causally happened before b — complete causal order
4.2 Broadcast Protocols
Broadcast = message to a group. Receiving ≠ delivering (hold until ordering satisfied).
Hierarchy (weakest → strongest)
- Best-effort — send, no retry on failure
- Reliable — correct nodes eventually deliver if sender correct
- FIFO — same-sender messages in send order
- Causal — if A→B causally, all nodes deliver A before B
- Total order — all nodes deliver all msgs in same order
- FIFO-total order — FIFO + total order. Strongest; used for SMR
4.3 Algorithms
- Eager reliable broadcast — rebroadcast on receipt → O(n²) msgs
- Gossip — forward to random subset. Probabilistic, efficient
5.1 Replication
Store same data on multiple nodes for availability & fault tolerance.
Idempotence
- f(f(x)) = f(x) — safe to retry. But f(g(f(x))) ≠ g(f(x)) in general
Retry semantics
- At-most-once — no retry (may lose op)
- At-least-once — retry until ACK (may apply twice)
- Exactly-once — retry + server deduplication (hardest)
Concurrent writes
- Timestamps + tombstones — deleted items leave tombstone
- Last Writer Wins (LWW) — highest timestamp survives
- Multi-value register — keep all versions, app resolves conflict
5.2 Quorums
Quorum = min nodes that must respond.
- Write quorum W: must write to ≥ W replicas
- Read quorum R: must read from ≥ R replicas
- R + W > N → overlap → read-after-write consistency
- Read repair: client finds stale data → writes fresh data back to stale replicas
5.3 State Machine Replication
Same initial state + same ops in same order → same state everywhere.
- Use FIFO-total order broadcast to deliver every update to all replicas
- Leader-based: single leader sequences & broadcasts writes to followers
Weaker broadcast tradeoffs
- Causal OK if concurrent updates commute + deterministic
- Reliable OK if all updates commute + deterministic
- Best-effort OK if also idempotent + tolerate message loss
6.1 Consensus
Several nodes agree on one value. Decision is permanent once made.
Why not just single leader?
- Easy for ordering, but leader crash requires consensus to safely pick new leader
- Split-brain: two nodes both think they're leader → data corruption
Common algorithms
- Paxos — single-value consensus
- Multi-Paxos — generalization to replicated log (TOB)
- Raft — same power as Multi-Paxos, designed for understandability
- ZAB, Viewstamped Replication — FIFO-TOB, used in ZooKeeper
- System model: crash-recovery + partial synchrony. Impossible in async (FLP)
6.2 Raft — Leader Election
Node states
- Follower — default; waits for heartbeats
- Candidate — didn't hear from leader, starts election
- Leader — elected; handles all writes, replicates log
Election flow
- Follower timeout → Candidate, increments term, requests votes
- Each node votes at most once per term
- Quorum of votes → Leader. Split vote → new election with higher term
- Each log entry = (msg, term). Multiple leaders OK only if from different terms
7.1 Two-Phase Commit (2PC)
Atomic commit across nodes: all commit or all abort. Any node can veto (unlike consensus).
Protocol
- Phase 1 (Prepare): coord sends Prepare(T) → each node votes Yes (locks) or No (aborts)
- Phase 2 (Commit): all Yes → Commit(T); any No → Abort(T)
Note: Fault-tolerant 2PC via consensus is Kleppmann's own variant. 3PC / Tx Commit are different.
7.2 Linearizability
Operations behave as if executed atomically at single point in time on single copy of data.
- Every read sees the most recent write
- Quorum reads + writes alone do NOT guarantee linearizability
ABD algorithm
- After reading from quorum, write value back to quorum before returning ('read repair')
- This extra write step makes quorum ops linearizable
- Linearizable CAS in distributed system requires total-order broadcast
7.3 Eventual Consistency
Linearizability is costly. Eventual consistency trades guarantees for perf.
Eventual consistency
- All non-faulty replicas eventually deliver same updates (no time guarantee)
Strong Eventual Consistency (SEC)
- Eventual delivery + convergence: same updates → same state even if in different orders
- No coordination needed; uses causal broadcast or weaker
- Concurrent updates must have a merge rule
CAP Theorem
- Strongly consistent (linearizable) OR available during network partition — not both
- Choose availability → must eventually reconcile diverged state
Summary: Consistency Models
| Problem | Wait for | Synchrony |
|---|---|---|
| Atomic Commit | All participating nodes | Partial sync |
| Consensus / TOB / Lin. CAS | Quorum | Partial sync |
| Linearizable get/set | Quorum | Async |
| Eventual / Causal / FIFO | None (local state) | Async |
Paper Notes
RPC — Birrell & Nelson 1984
Call a function on another machine as if local (location transparency). Stack: user → user-stub (marshals args) → RPCRuntime → server-stub (unmarshals) → server. Reply travels back in reverse. Key choices: binding via Grapevine name service; at-most-once semantics via dedup table; no shared address space. Perf: ≈1100µs round-trip vs 9µs local — network dominates, not stub overhead. Trap: remote call can silently fail; must handle crash/timeout explicitly, unlike local calls.
Raft — Ongaro & Ousterhout 2014
Consensus / replicated log designed for understandability; equivalent power to Multi-Paxos. 3 roles: Leader (1 per term, all writes), Follower (replicates), Candidate (election). Election: follower times out → increments term, sends RequestVote; wins with majority if log is up-to-date. Randomised timeouts break splits. Log replication: leader appends, sends AppendEntries, commits once majority ACK. Entries flow leader → followers only. Safety: candidate must hold all committed entries to win ⇒ committed entries survive leader change. Invariant: same (index, term) ⇒ identical logs; committed entries never overwritten.
FLP — Fischer, Lynch, Paterson 1985
Theorem: No deterministic consensus protocol can guarantee termination in a fully asynchronous system with even one crash-stop failure. Why: Can't distinguish dead from very slow. Every protocol has a "bivalent" (undecided) config; adversary delays one message to keep it there forever. Does NOT say consensus is impossible in practice — only that guaranteed termination requires timing assumptions. Escapes: partial synchrony + timeouts; randomisation (prob-1 termination); failure detectors.
Partial Synchrony — Dwork, Lynch, Stockmeyer 1988
Timing bounds exist but are unknown (or only hold after Global Stabilisation Time, GST). Result: consensus is solvable under partial synchrony. Safety never requires timing; liveness needs timeouts that fire correctly after GST. Paxos/Raft live here: always safe, just possibly slow during async periods.
Paper Trail + Sergey Notes on FLP
Paper Trail: every initial config is bivalent; every bivalent config has a bivalent successor reachable by delaying one process ⇒ adversary keeps system undecided indefinitely. Sergey: FLP attacks progress not correctness. Protocol may be stuck but never wrong. Escape via clocks/timeouts (partial synchrony).
Failure Detectors — Chandra & Toueg 1996
Oracle that suspects crashed processes (allowed to be wrong). Completeness: every crashed process eventually suspected. Accuracy: degree to which live processes avoid false suspicion. Classes: ◇W (eventually weak) is the minimal power needed to solve consensus in async model. Paxos/Raft timeouts implement ◇P (eventually perfect).
Chandy-Lamport Snapshot — 1985
Take a consistent global checkpoint of a live system without stopping it. Algorithm: initiator saves state, sends marker on all outgoing channels. On first marker received on channel c: save state, buffer all other incoming channels until their marker arrives. Result: consistent cut — no recorded receipt predates a recorded send. Uses: deadlock detection, checkpointing, GC.
State Machine Replication — Schneider 1990
Same initial state + same commands in same order ⇒ identical replicas. Requirements: (1) same requests to all; (2) same total order (= total-order broadcast); (3) deterministic execution. Fail-stop needs f+1 replicas; Byzantine needs 3f+1. SMR ≡ consensus ≡ TOB (inter-reducible).
Chain Replication — van Renesse & Schneider 2004
Servers in a chain. Head takes writes; each node forwards; Tail commits and serves reads. Linearizability: reads from Tail always see latest committed write. Failure: separate master removes failed node, links predecessor to successor. Benefit: write/read load separated; simpler linearizability than Paxos for primary-backup.
Part-Time Parliament (Paxos) — Lamport 1998
Phase 1 (Prepare/Promise): proposer picks ballot b, asks acceptors to promise not to accept b′ < b and to report any already-accepted value. Phase 2 (Accept/Accepted): proposer sends value (highest already-accepted if any); committed once majority accepts. Multi-Paxos: stable leader skips Phase 1 for subsequent slots; gaps filled with no-ops. Safety: two majorities always overlap, preventing two decisions. Liveness: needs stable leader; dueling proposers can stall.
Paxos Made Moderately Complex (PMMC)
Multi-Paxos roles: Replicas (hold log), Leaders (scout → Phase1, commander → Phase2), Acceptors (vote). Scout → p1a → quorum p1b → adopted. Commander → p2a → quorum p2b → decision. Higher ballot preempts → restart scout.
ZooKeeper / ZAB
ZooKeeper: coordination service — znodes (tree), watches, ephemeral nodes. Linearizable writes; FIFO client-order reads. ZAB: FIFO-total-order broadcast. Leader proposes with epoch+zxid; followers ACK; leader commits. Unlike Paxos: FIFO per-sender ordering is a hard guarantee; epoch = term. Recovery: new leader re-proposes all ACKed-but-uncommitted ops before taking new requests.
Byzantine Generals — Lamport, Shostak, Pease 1982
n nodes, ≤ f Byzantine (arbitrary/lying). Loyal nodes must agree. Oral msgs: solvable iff n ≥ 3f+1. Signed msgs: solvable for any n > f. Intuition: without signatures, one traitor can relay conflicting stories to two loyals who can't detect forgery.
PBFT — Castro & Liskov 1999
First practical BFT for async networks; n ≥ 3f+1. 3 phases: Pre-prepare (primary assigns seq#), Prepare (wait 2f matching), Commit (wait 2f+1, execute). View change: replicas timeout primary → elect next, recover uncommitted ops. MACs in normal path (not public-key). Only 3% slower than unreplicated NFS.
Consistent Hashing — Karger et al. 1997
Problem: standard modular hashing remaps ≈all keys when a server joins/leaves. Fix: hash servers and keys onto a ring; key → first server clockwise. Adding/removing 1 server moves only O(K/n) keys. Virtual nodes balance load. Used in: Akamai CDN, Dynamo, Cassandra.
Chord — Stoica et al. 2001
P2P lookup: key → responsible node in O(log N) hops via finger table (O(log N) pointers at exponentially increasing ring distances). Successor pointer alone guarantees correctness (slow); fingers give speed. Join/leave: O(log² N) msgs.
Bayou — Terry et al. 1995
Weakly-connected replicated storage; writes accepted offline. Tentative writes committed in global order by primary. Conflicts resolved per-write via dependency check + merge procedure. On sync: undo tentative log, apply committed, replay tentative. App-specific conflict resolution; eventual consistency with causal ordering.
Dynamo — DeCandia et al. 2007
Highly available KV store; AP (trades consistency for availability). Consistent hashing with virtual nodes. Quorums: R+W > N; typical N=3, R=W=2. Conflicts: vector clocks track causality; divergent versions sent to client for app-level merge. Gossip for membership/failure detection. Merkle trees for anti-entropy sync.
Session Guarantees — Terry et al. 1994
4 guarantees weaker than linearizability, stronger than nothing, for eventual-consistency stores: Read Your Writes; Monotonic Reads (no going back in time); Writes Follow Reads (causal order); Monotonic Writes (session order). Implemented via per-session version vectors; replica must be at least as fresh as session's last-seen state.
Linearizability — Herlihy & Wing 1990
Strongest correctness condition for concurrent objects. Every op appears to take effect atomically at a single point (linearisation point) between its invocation and response. Local: system is linearizable iff each object is linearizable independently. Not serializability: linearizability is per-op/per-object; serializable is per-transaction. Requires global coordination; CAP-incompatible with availability.
ABD / Shared Memory in Message Passing — Attiya et al. 1995
Simulates a linearizable register with quorums. Write: broadcast (ts, val) to all; wait for majority ACK. Read: query majority → find highest (ts, val) → write it back to majority (read-repair) → return val. Write-back step is what gives linearizability; without it, stale reads are possible.
2PC — Gray 1981
All nodes commit or all abort (atomic commit). Phase 1: coordinator sends Prepare; participants vote Yes (lock) or No. Phase 2: all Yes → Commit; any No → Abort. Blocking: coordinator crash after Yes votes but before Commit ⇒ participants hold locks indefinitely. Not fault-tolerant.
3PC — Skeen 1983
Breaks 2PC blocking by adding Pre-commit phase: Prepare → Pre-commit → Commit. Pre-commit signals "all voted Yes"; a node receiving it can commit or abort safely even if coordinator crashes. Caveat: fails under network partitions (split-brain). Rarely used; Paxos-based 2PC preferred.
Consensus on Transaction Commit — Lamport & Gray 2006
Replace 2PC coordinator with a Paxos group. Each participant's Yes/No vote is logged via Paxos; any node can reconstruct the outcome. Non-blocking atomic commit tolerating f failures with 2f+1 Paxos acceptors.
Percolator — Peng & Dabber 2010
Incremental processing on Bigtable; powers Google web indexing. Transactions: snapshot isolation via MVCC timestamps. Reads use start-ts snapshot; commit assigns commit-ts. 2PC over Bigtable: locks are Bigtable cells; primary lock committed first, then secondaries. Observers: trigger on column change, replace MapReduce for incremental work. Result: document age in index dropped ∼50%.
Coda — Satyanarayanan et al.
DFS for mobile/disconnected clients (CMU). Disconnected op: Venus cache on client; reads/writes hit cache offline; reintegration replays writes on reconnect. Version vectors detect conflicts; flagged to user. Hoarding: pre-fetch predicted-needed files before disconnect.
NFS — Sun 1985
First widely-deployed DFS. Stateless server (each RPC self-contained with file handle + offset) ⇒ crash recovery = client retry. Close-to-open consistency: flush on close, check freshness on open. Not linearizable. Lesson: statelessness and simplicity beat strong consistency for adoption.
MapReduce — Dean & Ghemawat 2004
Parallel batch processing model: map(k,v)→[(k',v')] then reduce(k',[v'])→[v'']. Framework handles partitioning, scheduling, re-execution. Straggler mitigation via backup tasks. Fault tolerance: re-run map on worker failure (local disk output); reduce output goes to GFS.
GFS — Ghemawat et al. 2003
Distributed FS for large sequential reads and append-mostly workloads. Single Master (metadata only; in-memory, WAL) + Chunkservers (64 MB chunks, 3× replicated). Client contacts master once for chunk locations, then writes directly to chunkservers (master off data path). Consistency: relaxed — concurrent appends may produce duplicates/padding; apps must be tolerant.
Bigtable — Chang et al. 2006
Sparse, distributed, persistent sorted map: (row, column family:qualifier, timestamp) → bytes. Rows lexicographically sorted; row ranges = tablets assigned to Tablet Servers by Master. Chubby (Paxos lock service) for master election and root tablet location. Write path: Memtable (in-RAM) + WAL → compacted to SSTable files on GFS.
Spanner — Corbett et al. 2012
Google's globally-distributed, strongly consistent relational DB. TrueTime: TT.now() returns interval [earliest, latest] bounding true wall-clock time (GPS + atomic clocks per datacenter, drift-bounded). External consistency: T1 commits before T2 starts ⇒ ts(T1) < ts(T2). Enforced by commit-wait: delay commit until TT.after(commit_ts). Enables lock-free read-only transactions and globally consistent snapshots. Paxos groups per shard; 2PC across shards.