System Design

Go from building CRUD backends to designing Twitter/Uber/Dropbox-scale systems and defending every tradeoff under load, failure, and consistency constraints — and then actually building the hard distributed primitives yourself.

System design is the skill that separates a senior engineer who ships features from a staff/principal engineer who shapes architecture.

The roadmap

Stage 1 — Fundamentals: latency, throughput, consistency, and back-of-envelope · 3-4 weeks
Build the quantitative and vocabulary foundation: reason numerically about latency/throughput, understand the CAP/PACELC tradeoff space, and estimate capacity before drawing a single box.

Concepts, resources and problems

Concepts — Latency vs throughput; tail latency (p50/p95/p99) and why averages lie; the tail-at-scale effect (one slow component dominates fan-out requests) · Latency numbers every engineer should know (memory vs SSD vs network vs cross-datacenter) · Back-of-envelope estimation: QPS, storage, bandwidth, memory, server count · Little's Law and basic queueing intuition (why a system falls off a cliff near saturation; the M/M/1 latency-vs-utilization curve) · CAP theorem precisely (it's about partitions), and PACELC as the better framing · Consistency models: linearizability, sequential, causal, eventual, read-your-writes — and the cost asymmetry between them · Availability math: nines, MTBF/MTTR, why redundancy multiplies availability, and why dependencies multiply downtime · Vertical vs horizontal scaling; statelessness; the 8 fallacies of distributed computing

Read — The System Design Primer (read intro + Performance vs scalability, Latency vs throughput, CAP, Consistency/Availability patterns) · Designing Data-Intensive Applications, Ch.1 (Reliable, Scalable, Maintainable) · Latency Numbers Every Programmer Should Know (interactive, by year) · The Tail at Scale (Dean & Barroso, CACM 2013) · Fallacies of Distributed Computing — and Marc Brooker's blog (AWS Principal Engineer) on why they bite

Watch — Gaurav Sen — System Design playlist (start: scalability, latency, throughput basics) · ByteByteGo — back-of-the-envelope estimation & latency fundamentals

Problems

medium Estimate the infra for a tweet-like service: 300M DAU, avg 2 tweets/day, 100:1 read/write — derive QPS, storage/yr, bandwidth, cache size, server count, AND the fan-out write amplification for a 10M-follower account — Forces every fundamental together: read/write skew, fan-out, hot keys, storage growth. Do it on paper, then defend each number out loud.
hard Derive the maximum sustainable QPS of a single service given a fixed thread pool and a measured p99 downstream call, using Little's Law — then plot latency vs utilization and show exactly where it collapses, and what changes if the downstream p99 doubles — This is the Jane-Street-flavored quantitative trap: most engineers can't connect concurrency limit, latency, and throughput. Getting the cliff right — and the nonlinearity near saturation — is the point.
hard Given a 4-node cluster with independent 99.9% node availability and quorum reads/writes, compute end-to-end availability and the read/write availability asymmetry; then redo it for N=5 with R=2,W=4 and explain why availability is not symmetric in R and W — Quorum availability is non-obvious and asymmetric; deriving it by hand is the gateway to understanding replication tradeoffs in Stage 2.
hard Using The Tail at Scale: a request fans out to 100 leaf servers and waits for all; each leaf has a 1% chance of a >1s response. Compute the probability the overall request exceeds 1s, and show how hedged requests cut the tail — The 1 - 0.99^100 ≈ 63% result is the single most counterintuitive fundamentals fact at scale, and reasoning about hedging/tied requests is staff-level latency thinking.

Done when — You can take any 'design X for N users' prompt and, in under 10 minutes on paper, produce defensible QPS / storage / bandwidth / server-count estimates, derive the tail-latency probability for a fan-out request, and state precisely which consistency and availability guarantees you're targeting and why.

Stage 2 — Building blocks: load balancers, caching, queues, CDNs, databases, replication, sharding · 5-7 weeks
Master the components every large system is assembled from, and the tradeoffs of each: when to cache, how to shard, what replication buys and costs, SQL vs NoSQL for real — and implement consistent hashing and an LSM/B-tree yourself.

Concepts, resources and problems

Concepts — Load balancing: L4 vs L7, algorithms (round-robin, least-conn, consistent hashing, power-of-two-choices), health checks · Caching: cache-aside vs write-through vs write-back, TTL/eviction (LRU/LFU/W-TinyLFU), cache stampede, thundering herd, request coalescing · CDNs and edge caching; push vs pull; cache invalidation (the hard problem) · Message queues & log-based brokers: at-least-once vs at-most-once, backpressure, dead-letter queues, Kafka's partitioned log model, consumer-group rebalancing · SQL vs NoSQL: when relational/ACID is right, when document/wide-column/KV wins; the real cost of joins and distributed transactions at scale · Indexing & storage engines: B-trees vs LSM-trees, write/read/space amplification, the RUM conjecture (read/update/memory tradeoff triangle), compaction · Replication: leader-follower, multi-leader, leaderless (Dynamo-style quorums); sync vs async; replication lag and its anomalies (read-your-writes, monotonic reads) · Partitioning/sharding: by key range vs hash; consistent hashing with virtual nodes; rebalancing; hot shards and the celebrity/whale problem

Read — DDIA Part II — Ch.3 Storage & Retrieval, Ch.5 Replication, Ch.6 Partitioning, Ch.7 Transactions · Dynamo: Amazon's Highly Available Key-value Store (SOSP 2007) · Kafka: a Distributed Messaging System for Log Processing (LinkedIn, NetDB 2011) · Jay Kreps — The Log: What every software engineer should know about real-time data's unifying abstraction

Watch — ByteByteGo — Consistent Hashing, Caching strategies, SQL vs NoSQL, Kafka explained · Hussein Nasser — Backend Engineering (databases, indexes, connection pooling, protocols) · CMU Intro to Database Systems (Andy Pavlo, 15-445) — storage, indexes, B-trees, LSM

Problems

hard Implement a consistent-hashing ring with virtual nodes; empirically measure load imbalance (max/mean bucket) as a function of vnodes-per-node, then prove the O(1/sqrt(V)) imbalance bound matches your data — Consistent hashing is deceptively subtle; proving the rebalancing and load-distribution properties (not just hand-waving) with real measured numbers is the difference between knowing the buzzword and understanding it.
hard Implement a Bitcask-style log-structured KV store (append-only log + in-memory hash index + compaction + crash recovery from the log), then benchmark write vs read amplification against a naive B-tree — Building the actual storage engine forces out every hand-wave about LSM/log-structured storage, durability, and the read/write/space tradeoff. This is the single highest-ROI implementation in Stage 2.
hard Design the sharding strategy for a multi-tenant DB where one tenant is 1000x larger than the rest — solve the hot-shard problem with online split/migration and zero downtime, and quantify the migration's impact on tail latency — The 'celebrity/whale' problem is where naive hash-sharding dies. Designing split/dynamic-sharding with online rebalancing is genuinely hard and is exactly what real systems (Vitess, Spanner, Cockroach) do.
brutal Given a leaderless quorum store with N=3, enumerate EVERY read/write anomaly possible under each (R,W) setting, identify which (R,W) give which guarantees, and show precisely which anomalies sloppy quorums + hinted handoff reintroduce even when R+W>N — Enumerating the actual anomaly set (stale reads, lost updates, concurrent-write conflicts, sloppy-quorum violations) is exactly the kind of rigorous case analysis that separates real understanding from buzzwords — and it's a classic distributed-systems exam killer.

Done when — Given any component choice (this cache vs that, SQL vs a wide-column store, sync vs async replication, hash vs range sharding) you can state the precise failure modes and consistency anomalies each introduces and pick correctly for a stated workload — and you've implemented consistent hashing AND a log-structured storage engine yourself with measured numbers.

Stage 3 — Distributed patterns: consensus, idempotency, rate limiting, CQRS, sagas, event sourcing · 6-8 weeks
Move from components to the coordination patterns that make distributed systems correct: agreement, exactly-once-effect, idempotency, transactional workflows across services — and understand consensus deeply enough to start implementing Raft.

Concepts, resources and problems

Concepts — Consensus: why it's needed, FLP impossibility, Paxos intuition, Raft (leader election, log replication, safety/Figure 8, membership changes), quorum intersection · Distributed transactions: 2PC and its blocking/coordinator-failure problem; why microservices avoid it; deterministic vs interactive transactions · Sagas (orchestration vs choreography) and compensating transactions for cross-service workflows · Idempotency & idempotency keys; exactly-once 'effect' vs exactly-once delivery (and why true exactly-once delivery is impossible — the Two Generals / FLP intuition) · Rate limiting algorithms: token bucket, leaky bucket, fixed/sliding window, sliding-window-log; distributed rate limiting with bounded staleness · CQRS and event sourcing: append-only log as source of truth, projections, replay, the read/write model split · Distributed time: clocks, NTP skew, logical clocks (Lamport), vector clocks, hybrid logical clocks (HLC); happens-before; why you can't trust wall clocks · Outbox pattern, change-data-capture (CDC), and reliably publishing events alongside DB writes (the dual-write problem)

Read — In Search of an Understandable Consensus Algorithm (Raft) — Ongaro & Ousterhout, plus the interactive viz · DDIA Part II Ch.8 (Trouble with Distributed Systems) & Ch.9 (Consistency and Consensus) · Distributed Systems for Fun and Profit (Takada) — esp. time/order and replication chapters · Marc Brooker — Timeouts, retries, backoff with jitter, and idempotency (brooker.co.za) + AWS Builders' Library

Watch — The Secret Lives of Data — Raft visualized (and the raft.github.io animation) · Martin Kleppmann — Distributed Systems lecture series (Cambridge, 8 lectures, free)

Problems

hard Design an idempotent payment-charge API: same idempotency key, concurrent retries, partial failures (charge succeeded but response lost), key expiry — guarantee at-most-once charge and prove correctness under every interleaving including the request-fingerprint mismatch case — Exactly-once-effect under concurrency + retries is the canonical hard pattern problem. The proof-of-correctness-under-interleavings is the Jane-Street energy.
hard Design a distributed sliding-window rate limiter accurate across N nodes with no single bottleneck; formally bound the worst-case over-admission vs an approximate gossip/redis approach, and state the consistency model your counter provides — Distributed counting with bounded staleness is a deceptively deep coordination problem; the exact-vs-approximate tradeoff and the formal error bound are the real lesson.
brutal Reconstruct Raft's Figure 8 safety scenario: build the exact interleaving where a naive 'commit any majority-replicated entry' rule would let a leader overwrite a committed entry, then show precisely which Raft restriction (commit-only-current-term + election restriction) prevents it — Reconstructing the exact safety violation Raft's restrictions prevent forces you to understand consensus at the level you'd need to implement or debug it — this is the question that exposes whether you really get Raft.
brutal Prove the dual-write problem: construct the crash interleaving where 'write DB then publish event' loses or duplicates an event, then design and argue the correctness of a transactional-outbox + CDC solution that survives a crash at every line — The dual-write problem is the silent data-loss bug in half of all microservice architectures; proving the outbox solution correct under arbitrary crashes is the integration of idempotency, logs, and CDC.

Done when — You can take a multi-service workflow (order -> payment -> inventory -> shipping) and design it correct under retries, partial failures, crashes, and out-of-order delivery using idempotency keys, sagas/outbox, CDC, and the right consistency model — and you can reconstruct Raft's Figure-8 safety argument well enough to start implementing it.

Stage 4 — Full system designs: from URL shortener to news feed, chat, search, and geo · 6-8 weeks
Synthesize everything into complete end-to-end designs under realistic constraints, fluently navigating the whole tradeoff space out loud the way a staff engineer would in a design review — then diff your design against how the real company actually did it.

Concepts, resources and problems

Concepts — The design framework: clarify requirements -> estimate -> define APIs -> data model -> high-level design -> deep-dive -> identify bottlenecks/SPOFs · URL shortener / pastebin: ID generation (counter vs hash vs Snowflake), KV store, cache, analytics · News feed: fan-out-on-write vs fan-out-on-read, the celebrity hybrid, feed ranking, timeline storage, pagination · Chat/messaging: WebSockets vs long-poll, presence, delivery & read receipts, message ordering, offline message storage, multi-device sync · Search/typeahead: inverted indexes, tries, ranking, sharded search, autocomplete latency budgets · Notification systems & fan-out at scale; push vs pull; device-token management; deduplication · Geo & proximity systems (Uber/Yelp): geohashing, quadtrees, S2 cells, matching, real-time location streams · Observability of your design: metrics/logs/traces, RED/USE methods, where the bottlenecks and SPOFs are, and how you'd load/chaos test it

Read — System Design Primer — worked solutions (Pastebin, Twitter timeline, web crawler, mint, query cache) · awesome-scalability — real-world architectures and engineering blog index (Netflix, Uber, Discord, etc.) · High Scalability — real architecture case studies (incl. 'How Discord Stores Trillions of Messages') · DDIA Ch.11 (Stream Processing) & Ch.12 (The Future of Data Systems)

Watch — ByteByteGo — full system design walkthroughs (Twitter, news feed, chat, notification, typeahead) · System Design Interview / Gaurav Sen — designing Uber, WhatsApp, Instagram, rate limiter

Problems

hard Design Twitter's timeline for 300M users with a celebrity having 100M followers — solve fan-out with the hybrid push/pull model, then defend the storage/latency math AND specify exactly the consistency model a user sees (can they see their own tweet immediately? a friend's? a celebrity's?) — The celebrity fan-out problem is the canonical 'naive design explodes' moment; nailing the hybrid AND the per-relationship consistency story is staff-level.
brutal Design a globally-distributed chat (WhatsApp/Slack-scale): per-conversation message ordering, exactly-once delivery to each of a user's devices, presence, and offline sync — across regions — then state which guarantees you sacrifice and why — Combines ordering, idempotency, fan-out, connection state, multi-device sync, and geo-distribution at once — the integration test of Stages 1-3.
brutal Design real-time ride-matching (Uber): S2/geohash indexing, supply/demand matching, surge pricing, and sub-second location updates for millions of concurrent drivers — quantify the write throughput of the location stream and how you shard it without hot cells — Geo-spatial indexing + real-time matching + a write-heavy location firehose is a genuinely hard, distinct problem class beyond CRUD-at-scale, and hot geo-cells are the celebrity problem in disguise.
brutal Design a Google-Docs-style collaborative editor: real-time concurrent edits with convergence and intention-preservation — choose between OT and CRDTs, and defend the consistency/availability tradeoff against the alternative — Collaborative editing is where eventual consistency gets genuinely subtle (concurrent inserts at the same position); choosing OT vs CRDT and arguing convergence is a frontier-flavored design problem most candidates can't touch.

Done when — You can stand at a whiteboard and design Twitter, WhatsApp, a typeahead service, Uber, or a collaborative editor end-to-end in 45 minutes — driving requirements, estimates, data model, high-level architecture, and at least two deep-dives — while proactively naming bottlenecks, SPOFs, the exact consistency model each user sees, and how you'd chaos-test it.

Stage 5 — Hard distributed systems: build the real thing and read the real papers · 10-16 weeks
Cross from 'can design on a whiteboard' to 'can build and reason about genuinely hard distributed systems'. Implement consensus, replication, and a sharded log; read the canonical papers critically; then stress your own system like Jepsen would.

Concepts, resources and problems

Concepts — Implementing Raft from scratch: leader election, log replication, persistence, snapshotting, log compaction, membership changes · Building a replicated, linearizable, sharded key-value store on top of consensus (the MIT 6.5840 labs arc: Raft -> KVRaft -> Sharded KV with live migration) · Linearizability testing and how to break your own system: partitions, clock skew, message reordering, GC pauses (Jepsen/Maelstrom/Knossos) · Exactly-once-effect pipelines: dedup, idempotent consumers, transactional outbox + CDC, and the precise limits of 'exactly once' · Globally-consistent state: TrueTime/Spanner, external consistency, the latency cost of strong consistency across regions · CRDTs and coordination-free replication: when you can avoid consensus entirely; CALM theorem (consistency as logical monotonicity) · Deterministic simulation testing (FoundationDB / TigerBeetle style) as the frontier of correctness, and TLA+ for model-checking your design · Reading distributed systems papers critically: failure model, assumptions, and — most important — what they explicitly DON'T guarantee

Read — MIT 6.5840 (formerly 6.824) — schedule, paper list, and the Raft / KVRaft / Sharded-KV labs · Spanner: Google's Globally-Distributed Database (OSDI 2012) · Jepsen analyses — how real databases violate their consistency claims · Papers We Love — distributed systems & datastores collection (Raft, Dynamo, Paxos Made Simple, Bigtable, Chubby, ZooKeeper/Zab)

Watch — MIT 6.824 lectures (Robert Morris) — full course on YouTube (Spring 2020) · Kyle Kingsbury (Aphyr) — Jepsen talks: 'breaking distributed systems for fun and profit'

Problems

brutal Implement Raft from scratch (leader election + log replication + persistence + snapshots/compaction) passing the 6.5840 Lab 3 test suite under injected network partitions, message loss, reordering, and server restarts — The definitive 'do you actually understand consensus' test. Passing the chaos-injecting suite is a rite of passage that takes weeks and teaches more than any reading — most people fail their first attempt.
brutal Build a fault-tolerant, sharded key-value store on top of your Raft with linearizable client semantics, a shard controller, and live shard migration with no lost or double-applied operations during reconfiguration (6.5840 Labs 4 + 5) — Composing consensus + sharding + reconfiguration into a correct linearizable store with live migration is real distributed-systems engineering — the hardest thing on this roadmap and a genuine portfolio piece.
brutal Solve all six Fly.io Gossip Glomers challenges, ending with #6 'Totally-Available Transactions' — your code is verified by a Jepsen-grade checker under network partitions — Adversarial, automatically-verified distributed challenges — the closest thing to 'competitive programming for distributed systems.' The broadcast-under-partition and totally-available-transaction problems are deviously hard and graded by a real consistency checker.
brutal Run your Raft/KV store under a Maelstrom or Jepsen-style harness with partitions, clock skew, and reordering; produce a writeup listing every invariant it holds and every one it violates — your own mini-Jepsen analysis — 'Exactly once' and 'linearizable' are famously easy to claim and hard to verify; building the test harness that catches your own violations is where theory meets a system you'd actually trust.

Done when — You have a working Raft implementation (ideally a sharded, live-migrating KV store) that passes the 6.5840 chaos suite, you've completed Gossip Glomers through totally-available transactions, you can read and critique the Dynamo/Raft/Spanner/Kafka papers including what they DON'T guarantee, and you can design a globally-consistent or exactly-once system and state precisely what it costs in latency and availability.

Projects

Distributed token-bucket rate limiter — Build a rate limiter as a standalone service: token bucket + sliding window, backed by Redis (or your own shared counter), enforcing per-user and per-IP limits across multiple app instances. Expose it as middleware and load-test it to find the cliff.
Log-structured storage engine (Bitcask/LSM) — Build an embeddable KV storage engine from scratch: an append-only write-ahead log, an in-memory index, crash recovery by replaying the log, and background compaction. Expose put/get/delete/scan and a benchmark harness.
Replicated key-value store — Implement a KV store with leader-follower replication: a leader accepts writes, replicates a WAL to followers, followers serve reads. Add tunable read consistency (read-from-leader vs read-from-any with bounded staleness) and failover.
Kafka-lite: a partitioned, replicated message log — Build a Kafka-lite: a partitioned, append-only log with producers, consumers, consumer groups, offset tracking, disk-persisted segments, and at-least-once delivery with replay from any offset.
Raft-backed, sharded, fault-tolerant KV store (FLAGSHIP) — Implement Raft from scratch (election, log replication, persistence, snapshots/compaction), build a linearizable KV store on top, then shard it across multiple Raft groups with a shard controller and live shard migration. Follow the MIT 6.5840 lab arc (Labs 3-5) and pass its chaos test suite.

Going harder

Hard problem arena — 6 brutal problems

brutal Gossip Glomers — Fly.io distributed systems challenges — Six escalating challenges run against a Jepsen-grade verifier: unique ID generation, broadcast/gossip with partitions, a grow-only counter, single- and multi-node Kafka-style logs, and totally-available transactions. Your code is adversarially partition-tested by a real consistency checker. This is the closest thing to 'Jane Street puzzles for distributed systems' that exists — the broadcast-under-partition (3d/3e) and totally-available-transactions (6) problems defeat most people.
legendary MIT 6.5840 Labs — Raft, fault-tolerant KV on Raft, sharded KV with live migration — The legendary lab arc. The Raft lab (Lab 3) alone defeats most people on the first attempt; the sharded-KV-with-live-migration lab (Lab 5) is one of the hardest self-study projects in all of systems. Test suites inject partitions, restarts, and reordering specifically to break naive implementations — and they run hundreds of times to catch rare races.
legendary Reproduce a Jepsen analysis on a real database — Pick a database, read Jepsen's published analysis in full, then reproduce a consistency violation yourself (or verify a claimed fix) using partitions and clock skew with Maelstrom or the Jepsen library. Reading these adversarially, then reproducing one, is graduate-level correctness work and an elite resume line.
legendary Model-check a real protocol in TLA+ (or P) — Specify a non-trivial protocol — your own commit protocol, a cache-coherence scheme, or a simplified Raft — in TLA+ and let the model checker (TLC) find the deadlock or invariant violation you didn't see. AWS famously found serious bugs in S3/DynamoDB this way. Formally proving (or disproving) your own design is the rare skill that separates principal engineers.
brutal Design a globally-consistent counter / exactly-once pipeline and confront the impossibility — The 'impossible-feeling' design problems: a counter that's both globally linearizable AND highly available (and confronting why CAP/PACELC say you can't have both during a partition), or an exactly-once delivery pipeline (and why true exactly-once delivery is impossible — Two Generals — but exactly-once EFFECT isn't). The reasoning, and being able to say precisely what's achievable and what it costs, is the prize.
brutal Paper reproductions: implement the load-bearing idea from a canonical paper — Don't just read the paper — build the load-bearing idea: a Bitcask-style log-structured KV store, an LSM-tree storage engine, vector clocks with conflict resolution, hybrid logical clocks, or a Bigtable-style SSTable layer. Implementing forces out every hand-wave the paper let you skip — this is where reading turns into understanding.

Keep curious

Blogs, people, communities, rabbit holes

Books beyond DDIA: 'Database Internals' (Alex Petrov) for storage engines & distributed DB internals; 'Understanding Distributed Systems' (Roberto Vitillo) as a modern approachable companion; 'Designing Distributed Systems' (Brendan Burns) for cloud-native patterns; and the new 2nd edition of DDIA (Kleppmann & Riccomini) as it ships.
Blogs to follow religiously: Marc Brooker (https://brooker.co.za/blog/) — AWS principal engineer, the clearest writer alive on retries/timeouts/consistency/load; Murat Demirbas 'Metadata' (https://muratbuffalo.blogspot.com/) — distributed-systems paper reviews; Aphyr/Kyle Kingsbury (https://aphyr.com/) — correctness and Jepsen teardowns; Brendan Gregg (https://www.brendangregg.com/) — performance & observability; Andy Pavlo / CMU DB group blog for DB internals.
Engineering blogs (read the source): AWS Builders' Library (https://aws.amazon.com/builders-library/ — short, deep, free essays by AWS principal engineers; start with the retries, leader-election, and caching ones), Netflix Tech Blog, Uber Engineering, Discord Engineering ('How Discord Stores Trillions of Messages'), Cloudflare Blog, Slack Engineering, and Stripe/Shopify engineering for correctness-at-scale.
Newsletters: ByteByteGo (https://blog.bytebytego.com/) for digestible system design; Gergely Orosz 'The Pragmatic Engineer' (https://newsletter.pragmaticengineer.com/) for real-world scale stories; and the archived 'Morning Paper' by Adrian Colyer (https://blog.acolyer.org/) for hundreds of paper summaries — still gold even though it's no longer updated.
Communities: r/ExperiencedDevs and r/distributedsystems; Papers We Love (https://paperswelove.org/ — local meetups + a huge talk archive on YouTube); the dbms.social / distributed-systems Discords; lobste.rs and Hacker News for architecture writeups and the comment debates that often beat the article.
Competitions & arenas to keep sharp: Fly.io Gossip Glomers (https://fly.io/dist-sys/), the MIT 6.5840 labs as a recurring re-do challenge, reproducing one Jepsen analysis per year as a personal tradition, and the TigerBeetle / FoundationDB simulation-testing rabbit hole.
Conferences & talks (watch the recordings): USENIX OSDI / NSDI / ATC (https://www.usenix.org/conferences — where the field's papers debut); CMU Database Group tech talks on YouTube (Andy Pavlo) for cutting-edge DB internals; the archived Strange Loop talks; QCon; and PWLConf (Papers We Love Conf).
People to follow: Martin Kleppmann (DDIA, CRDTs, local-first), Kyle Kingsbury / @aphyr (Jepsen), Marc Brooker (AWS), Andy Pavlo (CMU databases), Jay Kreps (Kafka/'The Log'), Murat Demirbas, Caitie McCaffrey (sagas / distributed transactions talks), Peter Bailis (consistency), and the TigerBeetle team (Joran Greef) for deterministic simulation.
Frontier rabbit holes — 'if this clicks, go deeper here': (1) CRDTs and local-first software (https://crdt.tech/, Kleppmann's local-first essay) — coordination-free replication; (2) Deterministic simulation testing (FoundationDB's approach + TigerBeetle's writeups and the 'VOPR' fuzzer) — the future of correctness testing; (3) Distributed SQL / NewSQL internals (Spanner, CockroachDB, FoundationDB, TigerBeetle architecture docs); (4) Formal methods — TLA+ (Lamport) and the P language (Microsoft) to model-check your own designs; (5) Streaming/dataflow — Kafka internals, Apache Flink, and the 'Turning the Database Inside Out' talk by Kleppmann.
After the roadmap: contribute to an open-source distributed system (etcd, TiKV, CockroachDB, NATS, Redis, Kafka, TigerBeetle), publish your own architecture deep-dives and your mini-Jepsen writeup, and read each year's OSDI/NSDI proceedings as they drop — a self-sustaining loop for the next decade.

How you'll know you've actually got it

Given any 'design X' prompt cold, you drive the whole framework in ~45 minutes — requirements, back-of-envelope estimates, APIs, data model, high-level design, two deep-dives — and you volunteer the bottlenecks, SPOFs, and consistency tradeoffs before anyone asks.
You can derive (not recall) quorum availability, fan-out storage costs, the tail-latency probability of a fan-out request, and the throughput cliff from Little's Law on paper, with the actual numbers.
You have a Raft implementation that passes a chaos-injecting test suite, and you can reconstruct a specific safety scenario it prevents (e.g. Figure 8) — not just 'it does leader election'.
You read a new architecture writeup or paper and instinctively ask 'what's the failure model, what does this NOT guarantee, and where would Jepsen break it?' — and you're usually right.
You can argue both sides of a real tradeoff (fan-out-on-write vs read, strong vs eventual consistency, SQL vs wide-column, OT vs CRDT) with concrete numbers and failure modes, and pick correctly for a given workload instead of by reflex.
You've built at least one of: a log-structured storage engine, a replicated/leaderless KV store, a distributed queue, or a sharded consensus-backed store — and you've broken it on purpose with partitions/clock skew and understood exactly why it failed.
You can take an 'impossible' requirement (globally-consistent + highly-available counter, exactly-once delivery) and explain precisely why it's impossible, what the achievable approximation is (exactly-once effect, commit-wait, etc.), and what it costs in latency and availability.
Peers bring you their designs for review, and your feedback consistently surfaces the failure mode they missed — you've become the person who finds the partition that breaks it.

← all roadmaps · back to hub