Distributed Systems - Class Notes

Notes from Distributed Systems. Split into two sections: Kleppmann lecture notes covering the core theory, and paper reading notes covering the seminal systems papers.

Kleppmann Notes Paper Notes

Kleppmann Notes

1.1 Introduction to Distributed Systems

No shared address space. Multiple devices communicate over network for one objective.

Why distribute?

Inherently distributed apps (e.g. messaging across the world)
Better reliability — survive node failures
Better perf — locality; solve bigger problems

★ Disadvantage: networks go down, processes crash → need fault tolerance

1.2 Network Abstractions

Model: node i → message m → node j. Real nets use different physical media.

Latency (seconds) & Bandwidth (bits/s) are the key properties
Observe at message level (Charles) or packet level (Wireshark)

1.3 Remote Procedure Calls (RPC)

RPC = local function call that runs on a remote machine. Goal: location transparency.

Flow

Client calls stub → stub marshals args into msg m1 → sent to server
Server unmarshals → runs real fn → marshals result → sends m2 back
Client stub unmarshals m2 → returns value as if local

★ Location transparency breaks with failures — remote calls can hang or silently lose results

In practice

Client–server: browser fetching web page
Microservices / SOA: large app split into services on many nodes
IDL (Interface Def. Lang) = language-agnostic type signatures → stubs for any lang
RPC enables interoperability across languages

2.1 Two Generals Problem

Two armies coordinate attack via messengers that may be captured. No guaranteed agreement over unreliable channel.

★ No 'common knowledge' possible over unreliable channel — can never be 100% sure both sides agreed

Online shops work because credit-card charges are revocable (escape hatch), not because problem is solved

2.2 Byzantine Generals

n generals, up to f traitors. Traitors send contradictory messages. Messages reliable; senders may lie.

Honest generals don't know who is a traitor; traitors know each other
Theorem: need ≥ 3f + 1 generals total to tolerate f traitors

★ Cryptography (digital signatures) helps — traitor can't forge a signature

2.3 System Models

Network

Reliable — always arrives, in order (not realistic)
Fair-loss — may be lost, retry eventually gets through
Arbitrary — loss, duplicates, reorder, corruption; partitions possible

Node failure

Crash-stop — crashes, stays silent forever
Crash-recovery — crashes but may restart; in-memory state lost
Byzantine — any deviation from algorithm (hardest)

Timing

Synchronous — known bounds on delay & processing
Partially synchronous — bounds exist but unknown, or sync after GST
Asynchronous — no timing guarantees

Note: Pure synchrony unrealistic. Most systems assume partial synchrony.

2.4 Fault Tolerance

Availability = uptime = fraction of time service runs correctly
SLO = availability goal; SLA = contract specifying SLO
Fault = one component misbehaves; Failure = whole system broken

Failure detectors

Perfect FD — never wrong. Only in synchronous systems
Via timeout — can't distinguish: crashed / slow / lost msg / delayed
Eventually perfect FD — may be wrong briefly, eventually correct. Works in partial sync

3.1 Physical Time & 3.2 Clock Sync

Physical clock = counts real seconds (drifts without correction)
Logical clock = counts events (not real seconds)
NTP — periodic network sync; PTP — GPS-synced, higher precision
Monotone clock — only moves forward; safe for elapsed time
Time-of-day clock — can jump backward; not safe for intervals

3.3 / 4.1 Ordering & Logical Time

'Happens before' (→) lets us order events without perfect clocks.

Lamport clocks

Increment counter L on every event
Attach L to every message sent
On receive: L = max(own L, received L) + 1
If a→b then L(a) < L(b). Converse NOT guaranteed.

Vector clocks

Each node keeps vector VC[n] (one entry per node)
On send: increment own entry, attach whole vector
On receive: element-wise max, then increment own entry
VC(a) < VC(b) iff a causally happened before b — complete causal order

4.2 Broadcast Protocols

Broadcast = message to a group. Receiving ≠ delivering (hold until ordering satisfied).

Hierarchy (weakest → strongest)

Best-effort — send, no retry on failure
Reliable — correct nodes eventually deliver if sender correct
FIFO — same-sender messages in send order
Causal — if A→B causally, all nodes deliver A before B
Total order — all nodes deliver all msgs in same order
FIFO-total order — FIFO + total order. Strongest; used for SMR

4.3 Algorithms

Eager reliable broadcast — rebroadcast on receipt → O(n²) msgs
Gossip — forward to random subset. Probabilistic, efficient

5.1 Replication

Store same data on multiple nodes for availability & fault tolerance.

Idempotence

f(f(x)) = f(x) — safe to retry. But f(g(f(x))) ≠ g(f(x)) in general

Retry semantics

At-most-once — no retry (may lose op)
At-least-once — retry until ACK (may apply twice)
Exactly-once — retry + server deduplication (hardest)

Concurrent writes

Timestamps + tombstones — deleted items leave tombstone
Last Writer Wins (LWW) — highest timestamp survives
Multi-value register — keep all versions, app resolves conflict

5.2 Quorums

Quorum = min nodes that must respond.

Write quorum W: must write to ≥ W replicas
Read quorum R: must read from ≥ R replicas
R + W > N → overlap → read-after-write consistency
Read repair: client finds stale data → writes fresh data back to stale replicas

5.3 State Machine Replication

Same initial state + same ops in same order → same state everywhere.

Use FIFO-total order broadcast to deliver every update to all replicas
Leader-based: single leader sequences & broadcasts writes to followers

Weaker broadcast tradeoffs

Causal OK if concurrent updates commute + deterministic
Reliable OK if all updates commute + deterministic
Best-effort OK if also idempotent + tolerate message loss

6.1 Consensus

Several nodes agree on one value. Decision is permanent once made.

★ Consensus ≡ total-order broadcast (formally equivalent)

Why not just single leader?

Easy for ordering, but leader crash requires consensus to safely pick new leader
Split-brain: two nodes both think they're leader → data corruption

Common algorithms

Paxos — single-value consensus
Multi-Paxos — generalization to replicated log (TOB)
Raft — same power as Multi-Paxos, designed for understandability
ZAB, Viewstamped Replication — FIFO-TOB, used in ZooKeeper
System model: crash-recovery + partial synchrony. Impossible in async (FLP)

6.2 Raft — Leader Election

Node states

Follower — default; waits for heartbeats
Candidate — didn't hear from leader, starts election
Leader — elected; handles all writes, replicates log

Election flow

Follower timeout → Candidate, increments term, requests votes
Each node votes at most once per term
Quorum of votes → Leader. Split vote → new election with higher term

★ Invariant 1: Same term# + same log index → logs identical up to that index

★ Invariant 2: Committed log entry never changed; uncommitted may be overwritten

Each log entry = (msg, term). Multiple leaders OK only if from different terms

7.1 Two-Phase Commit (2PC)

Atomic commit across nodes: all commit or all abort. Any node can veto (unlike consensus).

Protocol

Phase 1 (Prepare): coord sends Prepare(T) → each node votes Yes (locks) or No (aborts)
Phase 2 (Commit): all Yes → Commit(T); any No → Abort(T)

★ Blocking problem: coordinator crash after Yes votes but before Commit → participants stuck holding locks

Note: Fault-tolerant 2PC via consensus is Kleppmann's own variant. 3PC / Tx Commit are different.

7.2 Linearizability

Operations behave as if executed atomically at single point in time on single copy of data.

Every read sees the most recent write
Quorum reads + writes alone do NOT guarantee linearizability

ABD algorithm

After reading from quorum, write value back to quorum before returning ('read repair')
This extra write step makes quorum ops linearizable
Linearizable CAS in distributed system requires total-order broadcast

7.3 Eventual Consistency

Linearizability is costly. Eventual consistency trades guarantees for perf.

Eventual consistency

All non-faulty replicas eventually deliver same updates (no time guarantee)

Strong Eventual Consistency (SEC)

Eventual delivery + convergence: same updates → same state even if in different orders
No coordination needed; uses causal broadcast or weaker
Concurrent updates must have a merge rule

CAP Theorem

Strongly consistent (linearizable) OR available during network partition — not both
Choose availability → must eventually reconcile diverged state

Summary: Consistency Models

Problem	Wait for	Synchrony
Atomic Commit	All participating nodes	Partial sync
Consensus / TOB / Lin. CAS	Quorum	Partial sync
Linearizable get/set	Quorum	Async
Eventual / Causal / FIFO	None (local state)	Async

Paper Notes

RPC — Birrell & Nelson 1984

Call a function on another machine as if local (location transparency). Stack: user → user-stub (marshals args) → RPCRuntime → server-stub (unmarshals) → server. Reply travels back in reverse. Key choices: binding via Grapevine name service; at-most-once semantics via dedup table; no shared address space. Perf: ≈1100µs round-trip vs 9µs local — network dominates, not stub overhead. Trap: remote call can silently fail; must handle crash/timeout explicitly, unlike local calls.

Raft — Ongaro & Ousterhout 2014

Consensus / replicated log designed for understandability; equivalent power to Multi-Paxos. 3 roles: Leader (1 per term, all writes), Follower (replicates), Candidate (election). Election: follower times out → increments term, sends RequestVote; wins with majority if log is up-to-date. Randomised timeouts break splits. Log replication: leader appends, sends AppendEntries, commits once majority ACK. Entries flow leader → followers only. Safety: candidate must hold all committed entries to win ⇒ committed entries survive leader change. Invariant: same (index, term) ⇒ identical logs; committed entries never overwritten.

FLP — Fischer, Lynch, Paterson 1985

Theorem: No deterministic consensus protocol can guarantee termination in a fully asynchronous system with even one crash-stop failure. Why: Can't distinguish dead from very slow. Every protocol has a "bivalent" (undecided) config; adversary delays one message to keep it there forever. Does NOT say consensus is impossible in practice — only that guaranteed termination requires timing assumptions. Escapes: partial synchrony + timeouts; randomisation (prob-1 termination); failure detectors.

Partial Synchrony — Dwork, Lynch, Stockmeyer 1988

Timing bounds exist but are unknown (or only hold after Global Stabilisation Time, GST). Result: consensus is solvable under partial synchrony. Safety never requires timing; liveness needs timeouts that fire correctly after GST. Paxos/Raft live here: always safe, just possibly slow during async periods.

Paper Trail + Sergey Notes on FLP

Paper Trail: every initial config is bivalent; every bivalent config has a bivalent successor reachable by delaying one process ⇒ adversary keeps system undecided indefinitely. Sergey: FLP attacks progress not correctness. Protocol may be stuck but never wrong. Escape via clocks/timeouts (partial synchrony).

Failure Detectors — Chandra & Toueg 1996

Oracle that suspects crashed processes (allowed to be wrong). Completeness: every crashed process eventually suspected. Accuracy: degree to which live processes avoid false suspicion. Classes: ◇W (eventually weak) is the minimal power needed to solve consensus in async model. Paxos/Raft timeouts implement ◇P (eventually perfect).

Chandy-Lamport Snapshot — 1985

Take a consistent global checkpoint of a live system without stopping it. Algorithm: initiator saves state, sends marker on all outgoing channels. On first marker received on channel c: save state, buffer all other incoming channels until their marker arrives. Result: consistent cut — no recorded receipt predates a recorded send. Uses: deadlock detection, checkpointing, GC.

State Machine Replication — Schneider 1990

Same initial state + same commands in same order ⇒ identical replicas. Requirements: (1) same requests to all; (2) same total order (= total-order broadcast); (3) deterministic execution. Fail-stop needs f+1 replicas; Byzantine needs 3f+1. SMR ≡ consensus ≡ TOB (inter-reducible).

Chain Replication — van Renesse & Schneider 2004

Servers in a chain. Head takes writes; each node forwards; Tail commits and serves reads. Linearizability: reads from Tail always see latest committed write. Failure: separate master removes failed node, links predecessor to successor. Benefit: write/read load separated; simpler linearizability than Paxos for primary-backup.

Part-Time Parliament (Paxos) — Lamport 1998

Phase 1 (Prepare/Promise): proposer picks ballot b, asks acceptors to promise not to accept b′ < b and to report any already-accepted value. Phase 2 (Accept/Accepted): proposer sends value (highest already-accepted if any); committed once majority accepts. Multi-Paxos: stable leader skips Phase 1 for subsequent slots; gaps filled with no-ops. Safety: two majorities always overlap, preventing two decisions. Liveness: needs stable leader; dueling proposers can stall.

Paxos Made Moderately Complex (PMMC)

Multi-Paxos roles: Replicas (hold log), Leaders (scout → Phase1, commander → Phase2), Acceptors (vote). Scout → p1a → quorum p1b → adopted. Commander → p2a → quorum p2b → decision. Higher ballot preempts → restart scout.

ZooKeeper / ZAB

ZooKeeper: coordination service — znodes (tree), watches, ephemeral nodes. Linearizable writes; FIFO client-order reads. ZAB: FIFO-total-order broadcast. Leader proposes with epoch+zxid; followers ACK; leader commits. Unlike Paxos: FIFO per-sender ordering is a hard guarantee; epoch = term. Recovery: new leader re-proposes all ACKed-but-uncommitted ops before taking new requests.

Byzantine Generals — Lamport, Shostak, Pease 1982

n nodes, ≤ f Byzantine (arbitrary/lying). Loyal nodes must agree. Oral msgs: solvable iff n ≥ 3f+1. Signed msgs: solvable for any n > f. Intuition: without signatures, one traitor can relay conflicting stories to two loyals who can't detect forgery.

PBFT — Castro & Liskov 1999

First practical BFT for async networks; n ≥ 3f+1. 3 phases: Pre-prepare (primary assigns seq#), Prepare (wait 2f matching), Commit (wait 2f+1, execute). View change: replicas timeout primary → elect next, recover uncommitted ops. MACs in normal path (not public-key). Only 3% slower than unreplicated NFS.

Consistent Hashing — Karger et al. 1997

Problem: standard modular hashing remaps ≈all keys when a server joins/leaves. Fix: hash servers and keys onto a ring; key → first server clockwise. Adding/removing 1 server moves only O(K/n) keys. Virtual nodes balance load. Used in: Akamai CDN, Dynamo, Cassandra.

Chord — Stoica et al. 2001

P2P lookup: key → responsible node in O(log N) hops via finger table (O(log N) pointers at exponentially increasing ring distances). Successor pointer alone guarantees correctness (slow); fingers give speed. Join/leave: O(log² N) msgs.

Bayou — Terry et al. 1995

Weakly-connected replicated storage; writes accepted offline. Tentative writes committed in global order by primary. Conflicts resolved per-write via dependency check + merge procedure. On sync: undo tentative log, apply committed, replay tentative. App-specific conflict resolution; eventual consistency with causal ordering.

Dynamo — DeCandia et al. 2007

Highly available KV store; AP (trades consistency for availability). Consistent hashing with virtual nodes. Quorums: R+W > N; typical N=3, R=W=2. Conflicts: vector clocks track causality; divergent versions sent to client for app-level merge. Gossip for membership/failure detection. Merkle trees for anti-entropy sync.

Session Guarantees — Terry et al. 1994

4 guarantees weaker than linearizability, stronger than nothing, for eventual-consistency stores: Read Your Writes; Monotonic Reads (no going back in time); Writes Follow Reads (causal order); Monotonic Writes (session order). Implemented via per-session version vectors; replica must be at least as fresh as session's last-seen state.

Linearizability — Herlihy & Wing 1990

Strongest correctness condition for concurrent objects. Every op appears to take effect atomically at a single point (linearisation point) between its invocation and response. Local: system is linearizable iff each object is linearizable independently. Not serializability: linearizability is per-op/per-object; serializable is per-transaction. Requires global coordination; CAP-incompatible with availability.

ABD / Shared Memory in Message Passing — Attiya et al. 1995

Simulates a linearizable register with quorums. Write: broadcast (ts, val) to all; wait for majority ACK. Read: query majority → find highest (ts, val) → write it back to majority (read-repair) → return val. Write-back step is what gives linearizability; without it, stale reads are possible.

2PC — Gray 1981

All nodes commit or all abort (atomic commit). Phase 1: coordinator sends Prepare; participants vote Yes (lock) or No. Phase 2: all Yes → Commit; any No → Abort. Blocking: coordinator crash after Yes votes but before Commit ⇒ participants hold locks indefinitely. Not fault-tolerant.

3PC — Skeen 1983

Breaks 2PC blocking by adding Pre-commit phase: Prepare → Pre-commit → Commit. Pre-commit signals "all voted Yes"; a node receiving it can commit or abort safely even if coordinator crashes. Caveat: fails under network partitions (split-brain). Rarely used; Paxos-based 2PC preferred.

Consensus on Transaction Commit — Lamport & Gray 2006

Replace 2PC coordinator with a Paxos group. Each participant's Yes/No vote is logged via Paxos; any node can reconstruct the outcome. Non-blocking atomic commit tolerating f failures with 2f+1 Paxos acceptors.

Percolator — Peng & Dabber 2010

Incremental processing on Bigtable; powers Google web indexing. Transactions: snapshot isolation via MVCC timestamps. Reads use start-ts snapshot; commit assigns commit-ts. 2PC over Bigtable: locks are Bigtable cells; primary lock committed first, then secondaries. Observers: trigger on column change, replace MapReduce for incremental work. Result: document age in index dropped ∼50%.

Coda — Satyanarayanan et al.

DFS for mobile/disconnected clients (CMU). Disconnected op: Venus cache on client; reads/writes hit cache offline; reintegration replays writes on reconnect. Version vectors detect conflicts; flagged to user. Hoarding: pre-fetch predicted-needed files before disconnect.

NFS — Sun 1985

First widely-deployed DFS. Stateless server (each RPC self-contained with file handle + offset) ⇒ crash recovery = client retry. Close-to-open consistency: flush on close, check freshness on open. Not linearizable. Lesson: statelessness and simplicity beat strong consistency for adoption.

MapReduce — Dean & Ghemawat 2004

Parallel batch processing model: map(k,v)→[(k',v')] then reduce(k',[v'])→[v'']. Framework handles partitioning, scheduling, re-execution. Straggler mitigation via backup tasks. Fault tolerance: re-run map on worker failure (local disk output); reduce output goes to GFS.

GFS — Ghemawat et al. 2003

Distributed FS for large sequential reads and append-mostly workloads. Single Master (metadata only; in-memory, WAL) + Chunkservers (64 MB chunks, 3× replicated). Client contacts master once for chunk locations, then writes directly to chunkservers (master off data path). Consistency: relaxed — concurrent appends may produce duplicates/padding; apps must be tolerant.

Bigtable — Chang et al. 2006

Sparse, distributed, persistent sorted map: (row, column family:qualifier, timestamp) → bytes. Rows lexicographically sorted; row ranges = tablets assigned to Tablet Servers by Master. Chubby (Paxos lock service) for master election and root tablet location. Write path: Memtable (in-RAM) + WAL → compacted to SSTable files on GFS.

Spanner — Corbett et al. 2012

Google's globally-distributed, strongly consistent relational DB. TrueTime: TT.now() returns interval [earliest, latest] bounding true wall-clock time (GPS + atomic clocks per datacenter, drift-bounded). External consistency: T1 commits before T2 starts ⇒ ts(T1) < ts(T2). Enforced by commit-wait: delay commit until TT.after(commit_ts). Enables lock-free read-only transactions and globally consistent snapshots. Paxos groups per shard; 2PC across shards.