Going Deep in One ML Field

Survey the five deep ML fields seriously enough to pick ONE, then go from competent practitioner to someone who reproduces seminal papers to reported numbers and lands merged PRs at the frontier of that field.

ML breadth is a commodity now; depth is not.

The roadmap

Stage 1 — Field A — LLMs & NLP (build a language model from scratch) · 10-14 weeks weeks
Understand transformers and modern LLMs deeply enough to implement one end-to-end: tokenizer, architecture, training loop, attention internals, and inference. This is the field most adjacent to a backend engineer who likes systems-flavored math.

Concepts, resources and problems

Concepts — Tokenization (BPE, byte-level), embeddings, positional encodings (sinusoidal, learned, RoPE, ALiBi) · Self-attention from first principles; multi-head, causal masking, KV-cache, grouped-query attention · Transformer block: residual streams, LayerNorm/RMSNorm, MLP/SwiGLU, pre-norm vs post-norm · Training: cross-entropy/next-token loss, AdamW, warmup+cosine schedule, gradient clipping, mixed precision, gradient accumulation · Scaling laws (Chinchilla-optimal compute/data tradeoff) and why they govern every training decision · Mixture-of-Experts, long-context tricks (sliding window, YaRN), state-space alternatives (Mamba) as contrast · Post-training: SFT, RLHF/DPO, reward models, instruction tuning · Inference: sampling (temperature/top-k/top-p), speculative decoding, quantization for serving, continuous batching · Evaluation: perplexity, benchmarks, contamination, and why eval is genuinely hard and easy to fool yourself on

Read — Stanford CS336: Language Modeling from Scratch (Spring 2025) — site + assignments · The Annotated Transformer (Harvard NLP) · Attention Is All You Need (Vaswani et al., 2017) · Lilian Weng — The Transformer Family v2 · Sebastian Raschka — Build a Large Language Model (From Scratch) repo

Watch — Andrej Karpathy — Let's build GPT: from scratch, in code, spelled out · Stanford CS336 — Language Modeling from Scratch (full 2025 lecture playlist) · Karpathy — Let's reproduce GPT-2 (124M)

Problems

medium Implement multi-head causal self-attention from scratch (no nn.MultiheadAttention); match a reference forward AND backward pass to 1e-5 — Forces you to truly understand shapes, masking, and numerical stability — and gradient-checking the backward pass is where the fakery gets exposed.
hard Write your own byte-level BPE tokenizer, train it on a corpus, and match HF tokenizer behavior byte-for-byte on tricky unicode, emoji, and special tokens — Tokenization is where subtle bugs live; getting merges + special tokens + pre-tokenization regex exactly right is genuinely hard and silently breaks everything downstream.
brutal Reproduce GPT-2 (124M) training and hit the reference HellaSwag accuracy / validation loss on FineWeb-Edu — A full seminal-model reproduction: data pipeline, mixed precision, LR schedule, distributed training, eval. This is the stage milestone and the bar is a published number, not 'it trained.'
hard Implement KV-cache + speculative decoding from scratch and measure real tokens/sec speedup against a verified-identical-output baseline — Inference efficiency is where backend instincts pay off; correctness under caching and draft-model acceptance is subtle, and the speedup is meaningless if outputs drift.
brutal Implement a Triton flash-attention forward kernel and beat naive PyTorch attention at long sequence lengths inside your GPT-2 — The hard cross-over into MLsys: IO-aware tiling inside your own model. If this is the part you love, Field E is calling.

Done when — You have a from-scratch GPT-2-class model: your own BPE tokenizer, transformer, training loop, and an inference path with KV-cache — and you reproduced a published eval number on FineWeb-Edu. You can read any LLM paper and implement its core idea in a day.

Stage 2 — Field B — Computer Vision (from CNNs to diffusion & ViTs) · 10-14 weeks weeks
Master the visual stack: convnets, vision transformers, self-supervised learning, and generative models (GANs/diffusion). Strong math, strong systems, and the most visually rewarding feedback loop.

Concepts, resources and problems

Concepts — Convolutions, pooling, receptive fields, ResNets and skip connections · Batch/Layer/Group norm; data augmentation as regularization (RandAugment, MixUp, CutMix) · Vision Transformers (ViT), patch embeddings, attention over image patches, why ViTs need data or strong augmentation · Object detection (R-CNN family, YOLO, DETR) and segmentation (U-Net, Mask R-CNN, SAM) · Self-supervised learning: contrastive (SimCLR, MoCo), DINO/DINOv2, masked autoencoders · Generative models: VAEs, GANs, and diffusion (DDPM, score-based / SDE, latent diffusion) · CLIP and multimodal/vision-language alignment · Classifier-free guidance, sampling schedules (DDIM, ancestral), and why diffusion training is stable but inference is an art

Read — Stanford CS231n: Deep Learning for Computer Vision — notes · CS231n 2025 course site & assignments · Lilian Weng — What are Diffusion Models? · An Image is Worth 16x16 Words (ViT, Dosovitskiy et al.) · Denoising Diffusion Probabilistic Models (DDPM, Ho et al.) · Hugging Face Diffusion Models Course

Watch — Stanford CS231n — Deep Learning for Computer Vision 2025 (playlist) · Outlier — Diffusion Models / DDPM math walkthrough · Yannic Kilcher — Vision Transformer (ViT) paper explained

Problems

medium Implement a ResNet and train CIFAR-10 to >94% from scratch, then ablate residual connections, BN, and augmentation to quantify exactly what each is worth — The classic CV rite of passage; the ablation table forces causal understanding, not just copying a training script.
hard Implement DDPM from scratch, derive and numerically verify the simplified loss, and generate coherent CIFAR-10 samples with classifier-free guidance — Diffusion is math-heavy; a correct, stable, from-scratch implementation whose loss matches your hand-derivation is a genuine accomplishment.
brutal Reproduce ViT on ImageNet-1k (or a documented subset) and match reported top-1 accuracy within a point, including the augmentation/regularization recipe the paper under-reports — Full seminal reproduction with the data/augmentation/optimization tricks that papers gloss over; timm is the reference to diff against.
hard Build CLIP-style contrastive training on a small image-text dataset and demonstrate zero-shot classification beating a random/linear baseline by a clear margin — Multimodal alignment from scratch tests both the symmetric InfoNCE loss design and large-batch systems realities (the batch IS the negatives).
hard Reproduce a latent-diffusion fine-tune (e.g. DreamBooth/LoRA on Stable Diffusion) and quantify identity fidelity vs prompt adherence — Bridges from-scratch DDPM to the production generative stack everyone actually ships; the failure modes (overfitting, language drift) are real frontier problems.

Done when — You can implement and train a CNN, a ViT, and a diffusion model from scratch, and you've reproduced one seminal vision paper to its reported numbers. You understand the generative stack well enough to debug sampling artifacts from first principles.

Stage 3 — Field C — Reinforcement Learning (the hardest math, the deepest rabbit hole) · 12-16 weeks weeks
Go from MDPs and dynamic programming to deep RL (DQN, policy gradients, PPO) and into the frontier (RLHF, offline RL, world models, agents). This is the most mathematically demanding field and the one most prone to brutal silent-failure debugging — peak Jane-Street energy.

Concepts, resources and problems

Concepts — Markov Decision Processes, Bellman equations, value/policy iteration, dynamic programming, contraction-mapping convergence proofs · Model-free prediction & control: Monte Carlo, TD(λ), Q-learning, SARSA · Function approximation and the deadly triad (bootstrapping + off-policy + approximation) and why it diverges · Deep value methods: DQN, double/dueling DQN, prioritized replay, target networks · Policy gradients: REINFORCE, actor-critic, A2C/A3C, GAE, the policy-gradient theorem derived · Trust regions: TRPO, PPO (the workhorse), and why clipping/KL control is everything · Exploration: epsilon-greedy, entropy bonus, intrinsic motivation (RND), UCB · Frontier: offline RL (CQL/IQL), model-based RL / world models (Dreamer), RLHF & DPO/GRPO for LLMs, multi-agent, LLM agents

Read — Sutton & Barto — Reinforcement Learning: An Introduction (free PDF, 2nd ed.) · OpenAI Spinning Up in Deep RL · Hugging Face Deep RL Course · CleanRL — single-file RL implementations · Proximal Policy Optimization (Schulman et al.) · Playing Atari with Deep RL (DQN, Mnih et al.)

Watch — DeepMind x UCL — Reinforcement Learning Lecture Series 2021 (Hado van Hasselt et al.) · Berkeley CS285 — Deep Reinforcement Learning (Sergey Levine, Fall 2023) · Costa Huang — The 37 Implementation Details of PPO

Problems

hard Solve Sutton & Barto Chapters 3-6 exercises by hand (Bellman optimality, value/policy iteration convergence, TD), and prove the policy-evaluation operator is a contraction — These are genuine proof-based math problems — the closest thing to Putnam energy inside ML proper.
brutal Implement DQN from scratch and reach a reported Atari Breakout score; debug the deadly triad, replay, target nets, frame stacking, and reward clipping until the curve matches CleanRL — DQN reproduction is infamous for silent failures; matching a logged benchmark curve, not 'it learns something,' is the bar.
brutal Implement PPO from scratch, get all 37 implementation details right, and match CleanRL reference returns on both a MuJoCo continuous task and an Atari task — The canonical 'why is RL so hard' challenge; a correct-looking PPO that silently underperforms is the rite of passage. Match the curve on two domains or you faked it.
hard Implement DPO (or GRPO) to fine-tune a small LLM from a preference dataset and show a measurable preference-win-rate improvement under a held-out judge — Connects RL to the LLM frontier; a strong cross-over project if you lean toward Field A, and the eval (not the training) is the hard part.
brutal Implement an offline-RL algorithm (CQL or IQL) on a D4RL dataset and beat behavior cloning on the reported normalized score — Offline RL is the current frontier and the distribution-shift failure modes are subtle; matching a normalized-score table is a real research-grade result.

Done when — You've solved the core Sutton & Barto exercises by hand and reproduced DQN and PPO to reported scores (matching benchmark curves, not vibes), having survived the deadly-triad debugging. You can read any deep-RL paper and spot the implementation details that will silently bite you.

Stage 4 — Field D — Recommender Systems (where ML meets large-scale backend, your unfair advantage) · 8-12 weeks weeks
Master industrial-scale recommendation: retrieval + ranking, embeddings, two-tower models, sequence models, and the systems that serve them at low latency. The most natural fit for a backend engineer — it IS a distributed systems problem with ML inside.

Concepts, resources and problems

Concepts — Problem framing: retrieval (candidate generation) vs ranking vs re-ranking funnel · Collaborative filtering, matrix factorization, implicit feedback (ALS, BPR) · Embeddings + approximate nearest neighbor search (FAISS, ScaNN, HNSW) and the recall/latency frontier · Two-tower / dual-encoder models for retrieval; negative sampling strategies (in-batch, hard negatives, mixed) · Feature engineering & wide-and-deep / DCN / DLRM for ranking; feature crosses · Sequence-based recommenders (GRU4Rec, SASRec, BERT4Rec) and the shift to generative/transformer recommenders · Calibration, position bias, and counterfactual/off-policy evaluation (IPS) · Serving: feature stores, real-time inference, latency budgets, A/B testing, sharded embedding tables — your backend wheelhouse

Read — Eugene Yan — Improving Recommendation Systems & Search in the Age of LLMs · Awesome Deep Learning Papers for Search/Recommendation/Advertising · Deep Neural Networks for YouTube Recommendations (Covington et al.) · Meta DLRM (Deep Learning Recommendation Model) paper · Google Recommendation Systems crash course (ML Education)

Watch — Stanford CS224W — Machine Learning with Graphs (Jure Leskovec) · ACM RecSys conference talks (official channel) · Two-Tower / dual-encoder retrieval explained

Problems

medium Implement matrix factorization (ALS + BPR) on MovieLens-25M and beat a popularity baseline on Recall@K and NDCG@K with a proper temporal train/test split — The foundational recsys problem done rigorously; the temporal split (not random) and proper ranking metrics are where most tutorials cheat.
hard Build a two-tower retrieval model + FAISS HNSW index serving sub-10ms ANN lookups under concurrent load; plot the recall@K vs latency Pareto frontier — The backend-engineer sweet spot: ML model + ANN index + latency budget under load. The Pareto curve, not a single number, is the deliverable.
brutal Reproduce DLRM on the Criteo 1TB click logs, match the reported AUC, and profile + optimize the embedding-table memory bottleneck — A real industrial-scale reproduction where the hard part is systems (terabyte sparse embedding tables, sharding) not just the model — pure unfair-advantage territory for you.
hard Implement SASRec (self-attentive sequential recommendation) and beat your MF baseline on the same temporal split and metrics — Bridges recsys to transformers; the sequence-modeling frontier of recommendation, and a fair head-to-head against your own baseline keeps you honest.
brutal Run a counterfactual / off-policy evaluation (IPS or doubly-robust) on logged data to estimate a new ranker's lift without an online A/B test — OPE is the genuinely hard, under-taught part of industrial recsys — getting an unbiased offline estimate of online lift is where the field's real difficulty lives.

Done when — You've built a full retrieval+ranking pipeline (two-tower retrieval -> ANN serving -> DLRM-style ranking) on a real dataset, reproduced an industrial paper to its reported metrics, and can reason quantitatively about the latency/recall/throughput tradeoffs end to end.

Stage 5 — Field E — ML Systems / MLOps / Efficiency (the engineer's home field: make it fast) · 12-16 weeks weeks
Own the layer that makes ML actually run: GPU kernels, distributed training, quantization, inference optimization, and the serving stack. This is where a backend engineer who loves systems has the most asymmetric advantage and the field is on fire in 2026.

Concepts, resources and problems

Concepts — GPU architecture: memory hierarchy (HBM/SRAM/registers), warps, occupancy, memory- vs compute-bound kernels · CUDA / Triton kernel programming; fusion, tiling, and the roofline model · FlashAttention and IO-aware algorithm design (why memory movement, not FLOPs, is the bottleneck) · Mixed precision (fp16/bf16/fp8), numerical stability, loss scaling · Quantization (GPTQ, AWQ, int8/int4, fp8) and pruning/distillation for inference · Distributed training: data/tensor/pipeline/sequence parallelism, ZeRO, FSDP · Inference systems: KV-cache management, paged attention (vLLM), continuous batching, speculative decoding · MLOps: experiment tracking, reproducibility, data/feature pipelines, model serving, observability

Read — CS336 systems lectures: GPUs, kernels & Triton, parallelism · FlashAttention paper (Dao et al.) · GPU MODE (formerly CUDA MODE) — lectures & community · Karpathy llm.c — LLM training in raw C/CUDA · Chip Huyen — Designing Machine Learning Systems / blog

Watch — GPU MODE lecture series (YouTube) · CUDA Programming Course — High-Performance Computing with GPUs (freeCodeCamp, full) · Stanford CS336 — kernels/Triton & parallelism lectures

Problems

hard Write a fused softmax and a tiled matmul in Triton, beat naive PyTorch on a real shape, and explain the achieved vs peak FLOPs with a roofline analysis — Your first real kernel where you must reason about memory movement and occupancy, not just correctness — and the roofline number keeps you honest.
brutal Reproduce a FlashAttention-style fused attention kernel in Triton and benchmark vs naive attention across sequence lengths, matching the IO-complexity story — The seminal MLsys reproduction; IO-aware tiling with online softmax is genuinely hard and the speedup curve must match the paper's IO argument.
brutal Solve KernelBench Level 1-2 problems: write GPU kernels that beat the PyTorch reference speed and pass correctness, and report fast_p — A 2025/2026 benchmark of exactly the hard kernel-writing tasks the frontier cares about right now — and there's a public leaderboard to measure yourself against.
hard Quantize an LLM to int4 (GPTQ/AWQ), measure quality (perplexity/benchmark) vs latency vs memory, then serve it with vLLM and benchmark throughput vs a naive HF serving baseline — End-to-end efficiency: this is the production inference problem the whole industry is racing on in 2026, and the three-way quality/latency/memory tradeoff is the real deliverable.
brutal Train a small model across 2+ GPUs with FSDP (or implement tensor parallelism for one layer by hand) and verify identical loss to single-GPU with measured throughput scaling — Distributed training correctness (bit-for-bit-ish loss match) plus scaling efficiency is the hardest systems skill in the field and the gate to working on real training stacks.

Done when — You can write and profile custom Triton/CUDA kernels, you reproduced a FlashAttention-style kernel matching its IO story, scored on KernelBench, and you can take a model from fp32 to a quantized, vLLM-served endpoint with measured throughput gains. You can read the MLsys frontier and contribute kernels.

Stage 6 — Commit & Summit — Pick ONE field and go to the frontier · 12-16 weeks (per chosen field, repeatable) weeks
Stop surveying. Use the decision criteria (see keepCurious + masterySignals) to choose ONE field, then spend a full season going from competent to contributor: reproduce a seminal paper end-to-end to reported numbers and land a real merged PR in an open-source ML project in that field.

Concepts, resources and problems

Concepts — Decision criteria: which field's HARD problems do you actually enjoy at 11pm on a Friday? (math-heavy=RL/diffusion; systems-heavy=MLsys/recsys; product+research=LLMs) · Depth over breadth: one field, one season (3-4 months), no field-switching · Reproduction as the unit of mastery: pick the field's seminal paper and reproduce it to reported numbers, documenting every gap · Open-source contribution as the proof of frontier-readiness · Reading the frontier: follow that field's top conference + 5 researchers + 1 newsletter religiously · Building in public: write up reproductions; teaching is the final compression of understanding

Read — Papers With Code — State-of-the-Art by task (pick your field's leaderboard) · ML Reproducibility Challenge (MLRC) · How to Read a Paper (Keshav, the three-pass method) · Your chosen field's flagship repo CONTRIBUTING.md (vLLM / diffusers / CleanRL / RecBole / nanoGPT)

Watch — Andrej Karpathy — Intro to Large Language Models (1hr talk) · NeurIPS / ICML / CVPR / RecSys / MLSys conference talk channel for your field

Problems

brutal Reproduce the seminal paper of your chosen field end-to-end and match reported numbers, with a public write-up documenting every gap between paper and reality — The single highest-signal mastery proof in this entire roadmap. The hardest parts are always the details the paper omits.
hard Land a merged, non-trivial PR in a top open-source ML project in your field (vLLM, diffusers, CleanRL, transformers, FAISS, RecBole) — Contributing to a tool thousands depend on, and defending it in review, is the line between 'studied the field' and 'works in the field.'
hard Write an original blog post that explains one frontier idea better than existing material, with runnable from-scratch code — Teaching at frontier-level is the final compression; if you can explain it clearly with code that runs, you own it.

Done when — You committed to ONE field for a full season, reproduced its seminal paper to reported numbers, and landed a merged PR in a real open-source project. You read the frontier weekly and have specific opinions. You are no longer a tourist — you are a practitioner in that field.

Projects

Tiny-but-real: train a working model in your top-2 candidate fields — Before committing, build one small end-to-end project in each of your two most appealing fields: e.g. a 10M-param GPT that generates coherent Shakespeare (LLMs) AND a from-scratch DDPM that generates recognizable MNIST/CIFAR digits (vision). Ship both, write a paragraph on which one you actually enjoyed debugging at 11pm.
Reproduce a famous result to its reported number — In your leading candidate field, reproduce one canonical model to its published metric: GPT-2 (124M) HellaSwag/val loss, ViT ImageNet top-1, PPO MuJoCo returns matching CleanRL's curve, DLRM Criteo AUC, or a FlashAttention-style kernel speedup curve. Keep a running 'paper vs reality' gap log of every undocumented detail you had to discover.
Field-defining flagship build — The capstone for your chosen field: (LLMs) a from-scratch LM with custom BPE tokenizer, training, DPO post-training, and a quantized vLLM-served chat endpoint; (vision) a latent diffusion model you trained and can prompt, with classifier-free guidance; (RL) an agent that solves a hard environment with PPO plus your own measured improvement over the baseline; (recsys) a full retrieval->ANN->ranking pipeline serving sub-50ms under load with an offline OPE estimate of lift; (MLsys) a custom Triton kernel that beats PyTorch inside a real model and scores on KernelBench.
Contribute to the frontier — Land a merged, non-trivial PR in a flagship open-source ML project in your field (vLLM, transformers, diffusers, CleanRL, FAISS, RecBole). A real bug fix, a kernel, a missing feature, or a reproduction the maintainers fold in — not a typo fix.
Beat a public benchmark / win a leaderboard slot — Pick a live, hard leaderboard in your field and earn a respectable public ranking: KernelBench (MLsys), a Kaggle competition (recsys/vision), a D4RL/Atari benchmark (RL), or an Open LLM-style eval (LLMs). The constraint of a shared, adversarial scoreboard forces real rigor.

Going harder

Hard problem arena — 9 brutal problems

hard Jane Street Monthly Puzzles (full archive) — The reference for 'hard but solvable with cleverness + code.' Pure problem-solving that sharpens the exact muscle ML reproduction demands. Work the whole archive; many require writing a solver, not just thinking.
legendary Putnam Competition problems (math hardness calibration) — The gold standard of brutal undergraduate proof math. If you lean toward RL/diffusion (the math-heavy fields), Putnam-style problems are your true north for what 'hard' actually means.
hard Project Euler — high-numbered / sub-25%-solved problems — Math-meets-code at scale; the hard tier rewards algorithmic insight and efficient computation — exactly the recsys/MLsys mindset where the naive solution is too slow.
brutal Reproduce a seminal paper to its reported number (one per field) — The defining hard problem of THIS path. GPT-2, ViT/DDPM, DQN/PPO, DLRM, FlashAttention — pick one and MATCH the number. The hardest parts are always the details the papers omit; matching, not approximating, is the whole point.
brutal KernelBench — write GPU kernels that beat the reference and climb the leaderboard — A 2025/2026 benchmark of genuinely hard CUDA/Triton kernel-writing tasks the frontier cares about, with a public leaderboard. If MLsys is your field, this is your arena and your scoreboard.
brutal The PPO/DQN '37 implementation details' gauntlet — RL's legendary trap: a correct-looking implementation that silently fails or quietly underperforms. Getting every detail right until your curve matches CleanRL's is a rite of passage and a humility lesson.
hard Land a merged PR in a flagship ML library — The real-world hard problem: make a non-trivial change inside a system thousands depend on, defend it in review against maintainers, and get it merged. Where studying becomes contributing.
hard Sutton & Barto exercises (the full set, by hand) — The most Putnam-like math inside ML proper: derivations of Bellman optimality, contraction-mapping convergence proofs, TD analysis. Do them with pen and paper, no autograd to hide behind.
hard Advent of Code (hard days) + Codeforces Div 1 — For keeping the raw algorithmic + implementation-speed muscle sharp year-round. The late-December AoC days and Codeforces Div 1 problems are the competitive-programming complement to Jane Street's lateral puzzles.

Keep curious

Blogs, people, communities, rabbit holes

HOW TO CHOOSE YOUR FIELD (the core decision): Ask which field's HARD problems you'd happily debug at 11pm on a Friday. Math-heavy and you love proofs/derivations -> Reinforcement Learning or Diffusion (vision). Systems-heavy and you love latency/throughput/memory -> ML Systems/Efficiency or Recommenders. Product + research + 'I want to build the thing everyone talks about' -> LLMs/NLP. As a backend engineer who likes hard math AND systems, your two highest-fit picks are (a) ML Systems/Efficiency — kernels, quantization, distributed training, vLLM — where your backend skills transfer almost 1:1 and the field is white-hot in 2026; and (b) Recommender Systems — literally a distributed-systems problem with ML inside: sub-50ms serving, terabyte embedding tables, A/B tests, OPE. RL is the most intellectually brutal and best if you want pure hard-math suffering. Pick ONE and give it a full season.
COMMITMENT RULE: one field, one season (3-4 months minimum), no switching mid-season. Survey breadth in Stages 1-5, then go monogamous. Re-evaluate only at season boundaries. Depth compounds; field-hopping resets the clock to zero.
Blogs (read these religiously): Lilian Weng (https://lilianweng.github.io/) — the most rigorous free explainers; Sebastian Raschka 'Ahead of AI' (https://magazine.sebastianraschka.com/) — implementation-first; Eugene Yan (https://eugeneyan.com/) — applied ML/recsys at scale; Jay Alammar 'The Illustrated Transformer' (https://jalammar.github.io/) — the visual intuition; Chip Huyen (https://huyenchip.com/blog/) — ML systems design; Distill (https://distill.pub/) — the gold standard of explanation (archived but timeless); Aman Chadha's AI Journal (https://aman.ai/) — exhaustive distilled notes.
Newsletters: Sebastian Raschka's 'Ahead of AI'; The Batch by DeepLearning.AI (https://www.deeplearning.ai/the-batch/); Import AI by Jack Clark (https://importai.net/); for MLsys specifically, follow the GPU MODE Discord digest. Pick 2, not 10.
People to follow (by field): LLMs — Andrej Karpathy, Jeremy Howard, Tri Dao; Vision — Lucas Beyer (giffmana), Phil Wang (lucidrains, who reimplements every paper as clean code), Robin Rombach; RL — Sergey Levine, John Schulman, Costa Huang (CleanRL); Recsys — Eugene Yan, Jure Leskovec, Maxim Naumov (DLRM); MLsys — Tri Dao (FlashAttention), Horace He (PyTorch), the vLLM team. Follow lucidrains' GitHub (https://github.com/lucidrains) specifically — he reproduces seminal papers as clean code constantly, and reading his diffs is a free masterclass.
Communities/Discords/subreddits: GPU MODE Discord (the CUDA/MLsys reading group — https://github.com/gpu-mode); EleutherAI Discord (open LLM research, where a lot of real work happens in public); Hugging Face Discord & forums; r/MachineLearning (https://www.reddit.com/r/MachineLearning/) for paper discussion; r/LocalLLaMA for the inference/quantization frontier; Papers We Love (https://paperswelove.org/) for seminal-paper reading groups; alphaXiv (https://www.alphaxiv.org/) for paper-comment threads with the authors.
Competitions / forced reproduction under pressure: Kaggle (https://www.kaggle.com/) — competitions are excellent forced-reproduction-under-pressure; AIcrowd (https://www.aicrowd.com/) — research-flavored challenges including RL; the ML Reproducibility Challenge (https://reproml.org/, now a NeurIPS track) — turn a reproduction into a citable artifact; KernelBench leaderboard (https://github.com/ScalingIntelligence/KernelBench) for MLsys; the NeurIPS competition track for frontier-aligned contests.
Conferences (watch the talks even if you don't attend): NeurIPS, ICML, ICLR (general); CVPR/ICCV (vision); ACL/EMNLP (NLP); ACM RecSys (recommenders, https://recsys.acm.org/); MLSys (systems, https://mlsys.org/). Proceedings + talk recordings are free and are where each field publicly sets its frontier annually — skim the accepted-paper list the week it drops.
Frontier rabbit holes (2026): test-time compute & reasoning models (o-series-style RL on chains of thought, GRPO); FP8/FP4 training and the race to the bottom on precision; MoE at extreme scale + single-kernel fused MoE; generative recommenders (transformer/LLM-based recsys replacing two-tower retrieval); world models and model-based RL (Dreamer-style); LLM agents and tool use; the LLM-writes-GPU-kernels loop (KernelBench, CUDA-writing agents); diffusion language models. Each is a multi-year obsession in itself.
If LLMs click -> go deeper: CS336 -> post-training (RLHF/DPO/GRPO) -> inference systems (vLLM internals, paged attention) -> kernels. If vision clicks -> CS231n -> diffusion math -> latent diffusion / video generation -> multimodal/VLMs. If RL clicks -> Sutton&Barto -> CS285 -> offline/model-based RL -> RLHF-as-RL. If recsys clicks -> two-tower -> sequence models -> generative recommenders -> the serving systems + OPE. If MLsys clicks -> Triton -> FlashAttention -> distributed training (FSDP/Megatron) -> custom inference engines. Each arrow is one season.
After this path: the natural sequels are (1) original research — find an unsolved problem in your field and publish (workshop paper -> conference); (2) build a product on your depth; (3) become the maintainer/teacher — your reproductions and write-ups become the resource others learn from (the lucidrains / Costa Huang path). Depth in one field is the launchpad, not the destination — and the reproduction + PR portfolio you built here is exactly what a frontier-lab or staff-ML-systems hiring loop wants to see.

How you'll know you've actually got it

You can take an arbitrary recent paper in your chosen field, read it once with the three-pass method, and implement its core contribution in a day or two — including the details the paper glossed over.
You reproduced at least one seminal paper to its reported number (matched, not approximated) and can explain precisely where your first three attempts failed and why.
You have a merged, non-trivial PR in a flagship open-source ML project in your field, and you understood the codebase well enough to defend the change against maintainer review.
When a model misbehaves, you debug from first principles (shapes, gradients, numerics, data, systems) instead of guessing hyperparameters — and you usually find it within a session.
You can derive the core math of your field on a whiteboard from memory: attention, the diffusion ELBO, the Bellman optimality equation, a two-tower InfoNCE loss, or a roofline analysis.
You have strong, specific opinions about your field's frontier and can name the 5 papers/people that matter most and why — and where the current approaches are weak.
You've taught it: a blog post, talk, or explainer that someone else actually used to learn the topic, with runnable code.
You can estimate compute, memory, and latency for a model on the back of an envelope before running anything, and you're usually within a factor of 2.
You read your field's flagship conference proceedings each year and skim arXiv weekly without it feeling like a chore — it feels like keeping up with friends.
You chose ONE field and stuck with it for a full season without field-hopping, and the depth visibly compounds: each new paper is easier than the last.

← all roadmaps · back to hub