My Learning Hub

System Design

Go from building CRUD backends to designing Twitter/Uber/Dropbox-scale systems and defending every tradeoff under load, failure, and consistency constraints — and then actually building the hard distributed primitives yourself.

System design is the skill that separates a senior engineer who ships features from a staff/principal engineer who shapes architecture.

The roadmap

Stage 1 — Fundamentals: latency, throughput, consistency, and back-of-envelope · 3-4 weeks
Build the quantitative and vocabulary foundation: reason numerically about latency/throughput, understand the CAP/PACELC tradeoff space, and estimate capacity before drawing a single box.

Concepts, resources and problems

Concepts — Latency vs throughput; tail latency (p50/p95/p99) and why averages lie; the tail-at-scale effect (one slow component dominates fan-out requests) · Latency numbers every engineer should know (memory vs SSD vs network vs cross-datacenter) · Back-of-envelope estimation: QPS, storage, bandwidth, memory, server count · Little's Law and basic queueing intuition (why a system falls off a cliff near saturation; the M/M/1 latency-vs-utilization curve) · CAP theorem precisely (it's about partitions), and PACELC as the better framing · Consistency models: linearizability, sequential, causal, eventual, read-your-writes — and the cost asymmetry between them · Availability math: nines, MTBF/MTTR, why redundancy multiplies availability, and why dependencies multiply downtime · Vertical vs horizontal scaling; statelessness; the 8 fallacies of distributed computing

ReadThe System Design Primer (read intro + Performance vs scalability, Latency vs throughput, CAP, Consistency/Availability patterns) · Designing Data-Intensive Applications, Ch.1 (Reliable, Scalable, Maintainable) · Latency Numbers Every Programmer Should Know (interactive, by year) · The Tail at Scale (Dean & Barroso, CACM 2013) · Fallacies of Distributed Computing — and Marc Brooker's blog (AWS Principal Engineer) on why they bite

WatchGaurav Sen — System Design playlist (start: scalability, latency, throughput basics) · ByteByteGo — back-of-the-envelope estimation & latency fundamentals

Problems

Done when — You can take any 'design X for N users' prompt and, in under 10 minutes on paper, produce defensible QPS / storage / bandwidth / server-count estimates, derive the tail-latency probability for a fan-out request, and state precisely which consistency and availability guarantees you're targeting and why.

Stage 2 — Building blocks: load balancers, caching, queues, CDNs, databases, replication, sharding · 5-7 weeks
Master the components every large system is assembled from, and the tradeoffs of each: when to cache, how to shard, what replication buys and costs, SQL vs NoSQL for real — and implement consistent hashing and an LSM/B-tree yourself.

Concepts, resources and problems

Concepts — Load balancing: L4 vs L7, algorithms (round-robin, least-conn, consistent hashing, power-of-two-choices), health checks · Caching: cache-aside vs write-through vs write-back, TTL/eviction (LRU/LFU/W-TinyLFU), cache stampede, thundering herd, request coalescing · CDNs and edge caching; push vs pull; cache invalidation (the hard problem) · Message queues & log-based brokers: at-least-once vs at-most-once, backpressure, dead-letter queues, Kafka's partitioned log model, consumer-group rebalancing · SQL vs NoSQL: when relational/ACID is right, when document/wide-column/KV wins; the real cost of joins and distributed transactions at scale · Indexing & storage engines: B-trees vs LSM-trees, write/read/space amplification, the RUM conjecture (read/update/memory tradeoff triangle), compaction · Replication: leader-follower, multi-leader, leaderless (Dynamo-style quorums); sync vs async; replication lag and its anomalies (read-your-writes, monotonic reads) · Partitioning/sharding: by key range vs hash; consistent hashing with virtual nodes; rebalancing; hot shards and the celebrity/whale problem

ReadDDIA Part II — Ch.3 Storage & Retrieval, Ch.5 Replication, Ch.6 Partitioning, Ch.7 Transactions · Dynamo: Amazon's Highly Available Key-value Store (SOSP 2007) · Kafka: a Distributed Messaging System for Log Processing (LinkedIn, NetDB 2011) · Jay Kreps — The Log: What every software engineer should know about real-time data's unifying abstraction

WatchByteByteGo — Consistent Hashing, Caching strategies, SQL vs NoSQL, Kafka explained · Hussein Nasser — Backend Engineering (databases, indexes, connection pooling, protocols) · CMU Intro to Database Systems (Andy Pavlo, 15-445) — storage, indexes, B-trees, LSM

Problems

Done when — Given any component choice (this cache vs that, SQL vs a wide-column store, sync vs async replication, hash vs range sharding) you can state the precise failure modes and consistency anomalies each introduces and pick correctly for a stated workload — and you've implemented consistent hashing AND a log-structured storage engine yourself with measured numbers.

Stage 3 — Distributed patterns: consensus, idempotency, rate limiting, CQRS, sagas, event sourcing · 6-8 weeks
Move from components to the coordination patterns that make distributed systems correct: agreement, exactly-once-effect, idempotency, transactional workflows across services — and understand consensus deeply enough to start implementing Raft.

Concepts, resources and problems

Concepts — Consensus: why it's needed, FLP impossibility, Paxos intuition, Raft (leader election, log replication, safety/Figure 8, membership changes), quorum intersection · Distributed transactions: 2PC and its blocking/coordinator-failure problem; why microservices avoid it; deterministic vs interactive transactions · Sagas (orchestration vs choreography) and compensating transactions for cross-service workflows · Idempotency & idempotency keys; exactly-once 'effect' vs exactly-once delivery (and why true exactly-once delivery is impossible — the Two Generals / FLP intuition) · Rate limiting algorithms: token bucket, leaky bucket, fixed/sliding window, sliding-window-log; distributed rate limiting with bounded staleness · CQRS and event sourcing: append-only log as source of truth, projections, replay, the read/write model split · Distributed time: clocks, NTP skew, logical clocks (Lamport), vector clocks, hybrid logical clocks (HLC); happens-before; why you can't trust wall clocks · Outbox pattern, change-data-capture (CDC), and reliably publishing events alongside DB writes (the dual-write problem)

ReadIn Search of an Understandable Consensus Algorithm (Raft) — Ongaro & Ousterhout, plus the interactive viz · DDIA Part II Ch.8 (Trouble with Distributed Systems) & Ch.9 (Consistency and Consensus) · Distributed Systems for Fun and Profit (Takada) — esp. time/order and replication chapters · Marc Brooker — Timeouts, retries, backoff with jitter, and idempotency (brooker.co.za) + AWS Builders' Library

WatchThe Secret Lives of Data — Raft visualized (and the raft.github.io animation) · Martin Kleppmann — Distributed Systems lecture series (Cambridge, 8 lectures, free)

Problems

Done when — You can take a multi-service workflow (order -> payment -> inventory -> shipping) and design it correct under retries, partial failures, crashes, and out-of-order delivery using idempotency keys, sagas/outbox, CDC, and the right consistency model — and you can reconstruct Raft's Figure-8 safety argument well enough to start implementing it.

Stage 4 — Full system designs: from URL shortener to news feed, chat, search, and geo · 6-8 weeks
Synthesize everything into complete end-to-end designs under realistic constraints, fluently navigating the whole tradeoff space out loud the way a staff engineer would in a design review — then diff your design against how the real company actually did it.

Concepts, resources and problems

Concepts — The design framework: clarify requirements -> estimate -> define APIs -> data model -> high-level design -> deep-dive -> identify bottlenecks/SPOFs · URL shortener / pastebin: ID generation (counter vs hash vs Snowflake), KV store, cache, analytics · News feed: fan-out-on-write vs fan-out-on-read, the celebrity hybrid, feed ranking, timeline storage, pagination · Chat/messaging: WebSockets vs long-poll, presence, delivery & read receipts, message ordering, offline message storage, multi-device sync · Search/typeahead: inverted indexes, tries, ranking, sharded search, autocomplete latency budgets · Notification systems & fan-out at scale; push vs pull; device-token management; deduplication · Geo & proximity systems (Uber/Yelp): geohashing, quadtrees, S2 cells, matching, real-time location streams · Observability of your design: metrics/logs/traces, RED/USE methods, where the bottlenecks and SPOFs are, and how you'd load/chaos test it

ReadSystem Design Primer — worked solutions (Pastebin, Twitter timeline, web crawler, mint, query cache) · awesome-scalability — real-world architectures and engineering blog index (Netflix, Uber, Discord, etc.) · High Scalability — real architecture case studies (incl. 'How Discord Stores Trillions of Messages') · DDIA Ch.11 (Stream Processing) & Ch.12 (The Future of Data Systems)

WatchByteByteGo — full system design walkthroughs (Twitter, news feed, chat, notification, typeahead) · System Design Interview / Gaurav Sen — designing Uber, WhatsApp, Instagram, rate limiter

Problems

Done when — You can stand at a whiteboard and design Twitter, WhatsApp, a typeahead service, Uber, or a collaborative editor end-to-end in 45 minutes — driving requirements, estimates, data model, high-level architecture, and at least two deep-dives — while proactively naming bottlenecks, SPOFs, the exact consistency model each user sees, and how you'd chaos-test it.

Stage 5 — Hard distributed systems: build the real thing and read the real papers · 10-16 weeks
Cross from 'can design on a whiteboard' to 'can build and reason about genuinely hard distributed systems'. Implement consensus, replication, and a sharded log; read the canonical papers critically; then stress your own system like Jepsen would.

Concepts, resources and problems

Concepts — Implementing Raft from scratch: leader election, log replication, persistence, snapshotting, log compaction, membership changes · Building a replicated, linearizable, sharded key-value store on top of consensus (the MIT 6.5840 labs arc: Raft -> KVRaft -> Sharded KV with live migration) · Linearizability testing and how to break your own system: partitions, clock skew, message reordering, GC pauses (Jepsen/Maelstrom/Knossos) · Exactly-once-effect pipelines: dedup, idempotent consumers, transactional outbox + CDC, and the precise limits of 'exactly once' · Globally-consistent state: TrueTime/Spanner, external consistency, the latency cost of strong consistency across regions · CRDTs and coordination-free replication: when you can avoid consensus entirely; CALM theorem (consistency as logical monotonicity) · Deterministic simulation testing (FoundationDB / TigerBeetle style) as the frontier of correctness, and TLA+ for model-checking your design · Reading distributed systems papers critically: failure model, assumptions, and — most important — what they explicitly DON'T guarantee

ReadMIT 6.5840 (formerly 6.824) — schedule, paper list, and the Raft / KVRaft / Sharded-KV labs · Spanner: Google's Globally-Distributed Database (OSDI 2012) · Jepsen analyses — how real databases violate their consistency claims · Papers We Love — distributed systems & datastores collection (Raft, Dynamo, Paxos Made Simple, Bigtable, Chubby, ZooKeeper/Zab)

WatchMIT 6.824 lectures (Robert Morris) — full course on YouTube (Spring 2020) · Kyle Kingsbury (Aphyr) — Jepsen talks: 'breaking distributed systems for fun and profit'

Problems

Done when — You have a working Raft implementation (ideally a sharded, live-migrating KV store) that passes the 6.5840 chaos suite, you've completed Gossip Glomers through totally-available transactions, you can read and critique the Dynamo/Raft/Spanner/Kafka papers including what they DON'T guarantee, and you can design a globally-consistent or exactly-once system and state precisely what it costs in latency and availability.

Projects

  • Distributed token-bucket rate limiter — Build a rate limiter as a standalone service: token bucket + sliding window, backed by Redis (or your own shared counter), enforcing per-user and per-IP limits across multiple app instances. Expose it as middleware and load-test it to find the cliff.
  • Log-structured storage engine (Bitcask/LSM) — Build an embeddable KV storage engine from scratch: an append-only write-ahead log, an in-memory index, crash recovery by replaying the log, and background compaction. Expose put/get/delete/scan and a benchmark harness.
  • Replicated key-value store — Implement a KV store with leader-follower replication: a leader accepts writes, replicates a WAL to followers, followers serve reads. Add tunable read consistency (read-from-leader vs read-from-any with bounded staleness) and failover.
  • Kafka-lite: a partitioned, replicated message log — Build a Kafka-lite: a partitioned, append-only log with producers, consumers, consumer groups, offset tracking, disk-persisted segments, and at-least-once delivery with replay from any offset.
  • Raft-backed, sharded, fault-tolerant KV store (FLAGSHIP) — Implement Raft from scratch (election, log replication, persistence, snapshots/compaction), build a linearizable KV store on top, then shard it across multiple Raft groups with a shard controller and live shard migration. Follow the MIT 6.5840 lab arc (Labs 3-5) and pass its chaos test suite.

Going harder

Hard problem arena — 6 brutal problems
  • brutal Gossip Glomers — Fly.io distributed systems challenges — Six escalating challenges run against a Jepsen-grade verifier: unique ID generation, broadcast/gossip with partitions, a grow-only counter, single- and multi-node Kafka-style logs, and totally-available transactions. Your code is adversarially partition-tested by a real consistency checker. This is the closest thing to 'Jane Street puzzles for distributed systems' that exists — the broadcast-under-partition (3d/3e) and totally-available-transactions (6) problems defeat most people.
  • legendary MIT 6.5840 Labs — Raft, fault-tolerant KV on Raft, sharded KV with live migration — The legendary lab arc. The Raft lab (Lab 3) alone defeats most people on the first attempt; the sharded-KV-with-live-migration lab (Lab 5) is one of the hardest self-study projects in all of systems. Test suites inject partitions, restarts, and reordering specifically to break naive implementations — and they run hundreds of times to catch rare races.
  • legendary Reproduce a Jepsen analysis on a real database — Pick a database, read Jepsen's published analysis in full, then reproduce a consistency violation yourself (or verify a claimed fix) using partitions and clock skew with Maelstrom or the Jepsen library. Reading these adversarially, then reproducing one, is graduate-level correctness work and an elite resume line.
  • legendary Model-check a real protocol in TLA+ (or P) — Specify a non-trivial protocol — your own commit protocol, a cache-coherence scheme, or a simplified Raft — in TLA+ and let the model checker (TLC) find the deadlock or invariant violation you didn't see. AWS famously found serious bugs in S3/DynamoDB this way. Formally proving (or disproving) your own design is the rare skill that separates principal engineers.
  • brutal Design a globally-consistent counter / exactly-once pipeline and confront the impossibility — The 'impossible-feeling' design problems: a counter that's both globally linearizable AND highly available (and confronting why CAP/PACELC say you can't have both during a partition), or an exactly-once delivery pipeline (and why true exactly-once delivery is impossible — Two Generals — but exactly-once EFFECT isn't). The reasoning, and being able to say precisely what's achievable and what it costs, is the prize.
  • brutal Paper reproductions: implement the load-bearing idea from a canonical paper — Don't just read the paper — build the load-bearing idea: a Bitcask-style log-structured KV store, an LSM-tree storage engine, vector clocks with conflict resolution, hybrid logical clocks, or a Bigtable-style SSTable layer. Implementing forces out every hand-wave the paper let you skip — this is where reading turns into understanding.

Keep curious

Blogs, people, communities, rabbit holes
  • Books beyond DDIA: 'Database Internals' (Alex Petrov) for storage engines & distributed DB internals; 'Understanding Distributed Systems' (Roberto Vitillo) as a modern approachable companion; 'Designing Distributed Systems' (Brendan Burns) for cloud-native patterns; and the new 2nd edition of DDIA (Kleppmann & Riccomini) as it ships.
  • Blogs to follow religiously: Marc Brooker (https://brooker.co.za/blog/) — AWS principal engineer, the clearest writer alive on retries/timeouts/consistency/load; Murat Demirbas 'Metadata' (https://muratbuffalo.blogspot.com/) — distributed-systems paper reviews; Aphyr/Kyle Kingsbury (https://aphyr.com/) — correctness and Jepsen teardowns; Brendan Gregg (https://www.brendangregg.com/) — performance & observability; Andy Pavlo / CMU DB group blog for DB internals.
  • Engineering blogs (read the source): AWS Builders' Library (https://aws.amazon.com/builders-library/ — short, deep, free essays by AWS principal engineers; start with the retries, leader-election, and caching ones), Netflix Tech Blog, Uber Engineering, Discord Engineering ('How Discord Stores Trillions of Messages'), Cloudflare Blog, Slack Engineering, and Stripe/Shopify engineering for correctness-at-scale.
  • Newsletters: ByteByteGo (https://blog.bytebytego.com/) for digestible system design; Gergely Orosz 'The Pragmatic Engineer' (https://newsletter.pragmaticengineer.com/) for real-world scale stories; and the archived 'Morning Paper' by Adrian Colyer (https://blog.acolyer.org/) for hundreds of paper summaries — still gold even though it's no longer updated.
  • Communities: r/ExperiencedDevs and r/distributedsystems; Papers We Love (https://paperswelove.org/ — local meetups + a huge talk archive on YouTube); the dbms.social / distributed-systems Discords; lobste.rs and Hacker News for architecture writeups and the comment debates that often beat the article.
  • Competitions & arenas to keep sharp: Fly.io Gossip Glomers (https://fly.io/dist-sys/), the MIT 6.5840 labs as a recurring re-do challenge, reproducing one Jepsen analysis per year as a personal tradition, and the TigerBeetle / FoundationDB simulation-testing rabbit hole.
  • Conferences & talks (watch the recordings): USENIX OSDI / NSDI / ATC (https://www.usenix.org/conferences — where the field's papers debut); CMU Database Group tech talks on YouTube (Andy Pavlo) for cutting-edge DB internals; the archived Strange Loop talks; QCon; and PWLConf (Papers We Love Conf).
  • People to follow: Martin Kleppmann (DDIA, CRDTs, local-first), Kyle Kingsbury / @aphyr (Jepsen), Marc Brooker (AWS), Andy Pavlo (CMU databases), Jay Kreps (Kafka/'The Log'), Murat Demirbas, Caitie McCaffrey (sagas / distributed transactions talks), Peter Bailis (consistency), and the TigerBeetle team (Joran Greef) for deterministic simulation.
  • Frontier rabbit holes — 'if this clicks, go deeper here': (1) CRDTs and local-first software (https://crdt.tech/, Kleppmann's local-first essay) — coordination-free replication; (2) Deterministic simulation testing (FoundationDB's approach + TigerBeetle's writeups and the 'VOPR' fuzzer) — the future of correctness testing; (3) Distributed SQL / NewSQL internals (Spanner, CockroachDB, FoundationDB, TigerBeetle architecture docs); (4) Formal methods — TLA+ (Lamport) and the P language (Microsoft) to model-check your own designs; (5) Streaming/dataflow — Kafka internals, Apache Flink, and the 'Turning the Database Inside Out' talk by Kleppmann.
  • After the roadmap: contribute to an open-source distributed system (etcd, TiKV, CockroachDB, NATS, Redis, Kafka, TigerBeetle), publish your own architecture deep-dives and your mini-Jepsen writeup, and read each year's OSDI/NSDI proceedings as they drop — a self-sustaining loop for the next decade.
How you'll know you've actually got it
  • Given any 'design X' prompt cold, you drive the whole framework in ~45 minutes — requirements, back-of-envelope estimates, APIs, data model, high-level design, two deep-dives — and you volunteer the bottlenecks, SPOFs, and consistency tradeoffs before anyone asks.
  • You can derive (not recall) quorum availability, fan-out storage costs, the tail-latency probability of a fan-out request, and the throughput cliff from Little's Law on paper, with the actual numbers.
  • You have a Raft implementation that passes a chaos-injecting test suite, and you can reconstruct a specific safety scenario it prevents (e.g. Figure 8) — not just 'it does leader election'.
  • You read a new architecture writeup or paper and instinctively ask 'what's the failure model, what does this NOT guarantee, and where would Jepsen break it?' — and you're usually right.
  • You can argue both sides of a real tradeoff (fan-out-on-write vs read, strong vs eventual consistency, SQL vs wide-column, OT vs CRDT) with concrete numbers and failure modes, and pick correctly for a given workload instead of by reflex.
  • You've built at least one of: a log-structured storage engine, a replicated/leaderless KV store, a distributed queue, or a sharded consensus-backed store — and you've broken it on purpose with partitions/clock skew and understood exactly why it failed.
  • You can take an 'impossible' requirement (globally-consistent + highly-available counter, exactly-once delivery) and explain precisely why it's impossible, what the achievable approximation is (exactly-once effect, commit-wait, etc.), and what it costs in latency and availability.
  • Peers bring you their designs for review, and your feedback consistently surfaces the failure mode they missed — you've become the person who finds the partition that breaks it.

← all roadmaps · back to hub