My Learning Hub

Going Deep in One ML Field

Survey the five deep ML fields seriously enough to pick ONE, then go from competent practitioner to someone who reproduces seminal papers to reported numbers and lands merged PRs at the frontier of that field.

ML breadth is a commodity now; depth is not.

The roadmap

Stage 1 — Field A — LLMs & NLP (build a language model from scratch) · 10-14 weeks weeks
Understand transformers and modern LLMs deeply enough to implement one end-to-end: tokenizer, architecture, training loop, attention internals, and inference. This is the field most adjacent to a backend engineer who likes systems-flavored math.

Concepts, resources and problems

Concepts — Tokenization (BPE, byte-level), embeddings, positional encodings (sinusoidal, learned, RoPE, ALiBi) · Self-attention from first principles; multi-head, causal masking, KV-cache, grouped-query attention · Transformer block: residual streams, LayerNorm/RMSNorm, MLP/SwiGLU, pre-norm vs post-norm · Training: cross-entropy/next-token loss, AdamW, warmup+cosine schedule, gradient clipping, mixed precision, gradient accumulation · Scaling laws (Chinchilla-optimal compute/data tradeoff) and why they govern every training decision · Mixture-of-Experts, long-context tricks (sliding window, YaRN), state-space alternatives (Mamba) as contrast · Post-training: SFT, RLHF/DPO, reward models, instruction tuning · Inference: sampling (temperature/top-k/top-p), speculative decoding, quantization for serving, continuous batching · Evaluation: perplexity, benchmarks, contamination, and why eval is genuinely hard and easy to fool yourself on

ReadStanford CS336: Language Modeling from Scratch (Spring 2025) — site + assignments · The Annotated Transformer (Harvard NLP) · Attention Is All You Need (Vaswani et al., 2017) · Lilian Weng — The Transformer Family v2 · Sebastian Raschka — Build a Large Language Model (From Scratch) repo

WatchAndrej Karpathy — Let's build GPT: from scratch, in code, spelled out · Stanford CS336 — Language Modeling from Scratch (full 2025 lecture playlist) · Karpathy — Let's reproduce GPT-2 (124M)

Problems

Done when — You have a from-scratch GPT-2-class model: your own BPE tokenizer, transformer, training loop, and an inference path with KV-cache — and you reproduced a published eval number on FineWeb-Edu. You can read any LLM paper and implement its core idea in a day.

Stage 2 — Field B — Computer Vision (from CNNs to diffusion & ViTs) · 10-14 weeks weeks
Master the visual stack: convnets, vision transformers, self-supervised learning, and generative models (GANs/diffusion). Strong math, strong systems, and the most visually rewarding feedback loop.

Concepts, resources and problems

Concepts — Convolutions, pooling, receptive fields, ResNets and skip connections · Batch/Layer/Group norm; data augmentation as regularization (RandAugment, MixUp, CutMix) · Vision Transformers (ViT), patch embeddings, attention over image patches, why ViTs need data or strong augmentation · Object detection (R-CNN family, YOLO, DETR) and segmentation (U-Net, Mask R-CNN, SAM) · Self-supervised learning: contrastive (SimCLR, MoCo), DINO/DINOv2, masked autoencoders · Generative models: VAEs, GANs, and diffusion (DDPM, score-based / SDE, latent diffusion) · CLIP and multimodal/vision-language alignment · Classifier-free guidance, sampling schedules (DDIM, ancestral), and why diffusion training is stable but inference is an art

ReadStanford CS231n: Deep Learning for Computer Vision — notes · CS231n 2025 course site & assignments · Lilian Weng — What are Diffusion Models? · An Image is Worth 16x16 Words (ViT, Dosovitskiy et al.) · Denoising Diffusion Probabilistic Models (DDPM, Ho et al.) · Hugging Face Diffusion Models Course

WatchStanford CS231n — Deep Learning for Computer Vision 2025 (playlist) · Outlier — Diffusion Models / DDPM math walkthrough · Yannic Kilcher — Vision Transformer (ViT) paper explained

Problems

Done when — You can implement and train a CNN, a ViT, and a diffusion model from scratch, and you've reproduced one seminal vision paper to its reported numbers. You understand the generative stack well enough to debug sampling artifacts from first principles.

Stage 3 — Field C — Reinforcement Learning (the hardest math, the deepest rabbit hole) · 12-16 weeks weeks
Go from MDPs and dynamic programming to deep RL (DQN, policy gradients, PPO) and into the frontier (RLHF, offline RL, world models, agents). This is the most mathematically demanding field and the one most prone to brutal silent-failure debugging — peak Jane-Street energy.

Concepts, resources and problems

Concepts — Markov Decision Processes, Bellman equations, value/policy iteration, dynamic programming, contraction-mapping convergence proofs · Model-free prediction & control: Monte Carlo, TD(λ), Q-learning, SARSA · Function approximation and the deadly triad (bootstrapping + off-policy + approximation) and why it diverges · Deep value methods: DQN, double/dueling DQN, prioritized replay, target networks · Policy gradients: REINFORCE, actor-critic, A2C/A3C, GAE, the policy-gradient theorem derived · Trust regions: TRPO, PPO (the workhorse), and why clipping/KL control is everything · Exploration: epsilon-greedy, entropy bonus, intrinsic motivation (RND), UCB · Frontier: offline RL (CQL/IQL), model-based RL / world models (Dreamer), RLHF & DPO/GRPO for LLMs, multi-agent, LLM agents

ReadSutton & Barto — Reinforcement Learning: An Introduction (free PDF, 2nd ed.) · OpenAI Spinning Up in Deep RL · Hugging Face Deep RL Course · CleanRL — single-file RL implementations · Proximal Policy Optimization (Schulman et al.) · Playing Atari with Deep RL (DQN, Mnih et al.)

WatchDeepMind x UCL — Reinforcement Learning Lecture Series 2021 (Hado van Hasselt et al.) · Berkeley CS285 — Deep Reinforcement Learning (Sergey Levine, Fall 2023) · Costa Huang — The 37 Implementation Details of PPO

Problems

Done when — You've solved the core Sutton & Barto exercises by hand and reproduced DQN and PPO to reported scores (matching benchmark curves, not vibes), having survived the deadly-triad debugging. You can read any deep-RL paper and spot the implementation details that will silently bite you.

Stage 4 — Field D — Recommender Systems (where ML meets large-scale backend, your unfair advantage) · 8-12 weeks weeks
Master industrial-scale recommendation: retrieval + ranking, embeddings, two-tower models, sequence models, and the systems that serve them at low latency. The most natural fit for a backend engineer — it IS a distributed systems problem with ML inside.

Concepts, resources and problems

Concepts — Problem framing: retrieval (candidate generation) vs ranking vs re-ranking funnel · Collaborative filtering, matrix factorization, implicit feedback (ALS, BPR) · Embeddings + approximate nearest neighbor search (FAISS, ScaNN, HNSW) and the recall/latency frontier · Two-tower / dual-encoder models for retrieval; negative sampling strategies (in-batch, hard negatives, mixed) · Feature engineering & wide-and-deep / DCN / DLRM for ranking; feature crosses · Sequence-based recommenders (GRU4Rec, SASRec, BERT4Rec) and the shift to generative/transformer recommenders · Calibration, position bias, and counterfactual/off-policy evaluation (IPS) · Serving: feature stores, real-time inference, latency budgets, A/B testing, sharded embedding tables — your backend wheelhouse

ReadEugene Yan — Improving Recommendation Systems & Search in the Age of LLMs · Awesome Deep Learning Papers for Search/Recommendation/Advertising · Deep Neural Networks for YouTube Recommendations (Covington et al.) · Meta DLRM (Deep Learning Recommendation Model) paper · Google Recommendation Systems crash course (ML Education)

WatchStanford CS224W — Machine Learning with Graphs (Jure Leskovec) · ACM RecSys conference talks (official channel) · Two-Tower / dual-encoder retrieval explained

Problems

Done when — You've built a full retrieval+ranking pipeline (two-tower retrieval -> ANN serving -> DLRM-style ranking) on a real dataset, reproduced an industrial paper to its reported metrics, and can reason quantitatively about the latency/recall/throughput tradeoffs end to end.

Stage 5 — Field E — ML Systems / MLOps / Efficiency (the engineer's home field: make it fast) · 12-16 weeks weeks
Own the layer that makes ML actually run: GPU kernels, distributed training, quantization, inference optimization, and the serving stack. This is where a backend engineer who loves systems has the most asymmetric advantage and the field is on fire in 2026.

Concepts, resources and problems

Concepts — GPU architecture: memory hierarchy (HBM/SRAM/registers), warps, occupancy, memory- vs compute-bound kernels · CUDA / Triton kernel programming; fusion, tiling, and the roofline model · FlashAttention and IO-aware algorithm design (why memory movement, not FLOPs, is the bottleneck) · Mixed precision (fp16/bf16/fp8), numerical stability, loss scaling · Quantization (GPTQ, AWQ, int8/int4, fp8) and pruning/distillation for inference · Distributed training: data/tensor/pipeline/sequence parallelism, ZeRO, FSDP · Inference systems: KV-cache management, paged attention (vLLM), continuous batching, speculative decoding · MLOps: experiment tracking, reproducibility, data/feature pipelines, model serving, observability

ReadCS336 systems lectures: GPUs, kernels & Triton, parallelism · FlashAttention paper (Dao et al.) · GPU MODE (formerly CUDA MODE) — lectures & community · Karpathy llm.c — LLM training in raw C/CUDA · Chip Huyen — Designing Machine Learning Systems / blog

WatchGPU MODE lecture series (YouTube) · CUDA Programming Course — High-Performance Computing with GPUs (freeCodeCamp, full) · Stanford CS336 — kernels/Triton & parallelism lectures

Problems

Done when — You can write and profile custom Triton/CUDA kernels, you reproduced a FlashAttention-style kernel matching its IO story, scored on KernelBench, and you can take a model from fp32 to a quantized, vLLM-served endpoint with measured throughput gains. You can read the MLsys frontier and contribute kernels.

Stage 6 — Commit & Summit — Pick ONE field and go to the frontier · 12-16 weeks (per chosen field, repeatable) weeks
Stop surveying. Use the decision criteria (see keepCurious + masterySignals) to choose ONE field, then spend a full season going from competent to contributor: reproduce a seminal paper end-to-end to reported numbers and land a real merged PR in an open-source ML project in that field.

Concepts, resources and problems

Concepts — Decision criteria: which field's HARD problems do you actually enjoy at 11pm on a Friday? (math-heavy=RL/diffusion; systems-heavy=MLsys/recsys; product+research=LLMs) · Depth over breadth: one field, one season (3-4 months), no field-switching · Reproduction as the unit of mastery: pick the field's seminal paper and reproduce it to reported numbers, documenting every gap · Open-source contribution as the proof of frontier-readiness · Reading the frontier: follow that field's top conference + 5 researchers + 1 newsletter religiously · Building in public: write up reproductions; teaching is the final compression of understanding

ReadPapers With Code — State-of-the-Art by task (pick your field's leaderboard) · ML Reproducibility Challenge (MLRC) · How to Read a Paper (Keshav, the three-pass method) · Your chosen field's flagship repo CONTRIBUTING.md (vLLM / diffusers / CleanRL / RecBole / nanoGPT)

WatchAndrej Karpathy — Intro to Large Language Models (1hr talk) · NeurIPS / ICML / CVPR / RecSys / MLSys conference talk channel for your field

Problems

Done when — You committed to ONE field for a full season, reproduced its seminal paper to reported numbers, and landed a merged PR in a real open-source project. You read the frontier weekly and have specific opinions. You are no longer a tourist — you are a practitioner in that field.

Projects

  • Tiny-but-real: train a working model in your top-2 candidate fields — Before committing, build one small end-to-end project in each of your two most appealing fields: e.g. a 10M-param GPT that generates coherent Shakespeare (LLMs) AND a from-scratch DDPM that generates recognizable MNIST/CIFAR digits (vision). Ship both, write a paragraph on which one you actually enjoyed debugging at 11pm.
  • Reproduce a famous result to its reported number — In your leading candidate field, reproduce one canonical model to its published metric: GPT-2 (124M) HellaSwag/val loss, ViT ImageNet top-1, PPO MuJoCo returns matching CleanRL's curve, DLRM Criteo AUC, or a FlashAttention-style kernel speedup curve. Keep a running 'paper vs reality' gap log of every undocumented detail you had to discover.
  • Field-defining flagship build — The capstone for your chosen field: (LLMs) a from-scratch LM with custom BPE tokenizer, training, DPO post-training, and a quantized vLLM-served chat endpoint; (vision) a latent diffusion model you trained and can prompt, with classifier-free guidance; (RL) an agent that solves a hard environment with PPO plus your own measured improvement over the baseline; (recsys) a full retrieval->ANN->ranking pipeline serving sub-50ms under load with an offline OPE estimate of lift; (MLsys) a custom Triton kernel that beats PyTorch inside a real model and scores on KernelBench.
  • Contribute to the frontier — Land a merged, non-trivial PR in a flagship open-source ML project in your field (vLLM, transformers, diffusers, CleanRL, FAISS, RecBole). A real bug fix, a kernel, a missing feature, or a reproduction the maintainers fold in — not a typo fix.
  • Beat a public benchmark / win a leaderboard slot — Pick a live, hard leaderboard in your field and earn a respectable public ranking: KernelBench (MLsys), a Kaggle competition (recsys/vision), a D4RL/Atari benchmark (RL), or an Open LLM-style eval (LLMs). The constraint of a shared, adversarial scoreboard forces real rigor.

Going harder

Hard problem arena — 9 brutal problems

Keep curious

Blogs, people, communities, rabbit holes
  • HOW TO CHOOSE YOUR FIELD (the core decision): Ask which field's HARD problems you'd happily debug at 11pm on a Friday. Math-heavy and you love proofs/derivations -> Reinforcement Learning or Diffusion (vision). Systems-heavy and you love latency/throughput/memory -> ML Systems/Efficiency or Recommenders. Product + research + 'I want to build the thing everyone talks about' -> LLMs/NLP. As a backend engineer who likes hard math AND systems, your two highest-fit picks are (a) ML Systems/Efficiency — kernels, quantization, distributed training, vLLM — where your backend skills transfer almost 1:1 and the field is white-hot in 2026; and (b) Recommender Systems — literally a distributed-systems problem with ML inside: sub-50ms serving, terabyte embedding tables, A/B tests, OPE. RL is the most intellectually brutal and best if you want pure hard-math suffering. Pick ONE and give it a full season.
  • COMMITMENT RULE: one field, one season (3-4 months minimum), no switching mid-season. Survey breadth in Stages 1-5, then go monogamous. Re-evaluate only at season boundaries. Depth compounds; field-hopping resets the clock to zero.
  • Blogs (read these religiously): Lilian Weng (https://lilianweng.github.io/) — the most rigorous free explainers; Sebastian Raschka 'Ahead of AI' (https://magazine.sebastianraschka.com/) — implementation-first; Eugene Yan (https://eugeneyan.com/) — applied ML/recsys at scale; Jay Alammar 'The Illustrated Transformer' (https://jalammar.github.io/) — the visual intuition; Chip Huyen (https://huyenchip.com/blog/) — ML systems design; Distill (https://distill.pub/) — the gold standard of explanation (archived but timeless); Aman Chadha's AI Journal (https://aman.ai/) — exhaustive distilled notes.
  • Newsletters: Sebastian Raschka's 'Ahead of AI'; The Batch by DeepLearning.AI (https://www.deeplearning.ai/the-batch/); Import AI by Jack Clark (https://importai.net/); for MLsys specifically, follow the GPU MODE Discord digest. Pick 2, not 10.
  • People to follow (by field): LLMs — Andrej Karpathy, Jeremy Howard, Tri Dao; Vision — Lucas Beyer (giffmana), Phil Wang (lucidrains, who reimplements every paper as clean code), Robin Rombach; RL — Sergey Levine, John Schulman, Costa Huang (CleanRL); Recsys — Eugene Yan, Jure Leskovec, Maxim Naumov (DLRM); MLsys — Tri Dao (FlashAttention), Horace He (PyTorch), the vLLM team. Follow lucidrains' GitHub (https://github.com/lucidrains) specifically — he reproduces seminal papers as clean code constantly, and reading his diffs is a free masterclass.
  • Communities/Discords/subreddits: GPU MODE Discord (the CUDA/MLsys reading group — https://github.com/gpu-mode); EleutherAI Discord (open LLM research, where a lot of real work happens in public); Hugging Face Discord & forums; r/MachineLearning (https://www.reddit.com/r/MachineLearning/) for paper discussion; r/LocalLLaMA for the inference/quantization frontier; Papers We Love (https://paperswelove.org/) for seminal-paper reading groups; alphaXiv (https://www.alphaxiv.org/) for paper-comment threads with the authors.
  • Competitions / forced reproduction under pressure: Kaggle (https://www.kaggle.com/) — competitions are excellent forced-reproduction-under-pressure; AIcrowd (https://www.aicrowd.com/) — research-flavored challenges including RL; the ML Reproducibility Challenge (https://reproml.org/, now a NeurIPS track) — turn a reproduction into a citable artifact; KernelBench leaderboard (https://github.com/ScalingIntelligence/KernelBench) for MLsys; the NeurIPS competition track for frontier-aligned contests.
  • Conferences (watch the talks even if you don't attend): NeurIPS, ICML, ICLR (general); CVPR/ICCV (vision); ACL/EMNLP (NLP); ACM RecSys (recommenders, https://recsys.acm.org/); MLSys (systems, https://mlsys.org/). Proceedings + talk recordings are free and are where each field publicly sets its frontier annually — skim the accepted-paper list the week it drops.
  • Frontier rabbit holes (2026): test-time compute & reasoning models (o-series-style RL on chains of thought, GRPO); FP8/FP4 training and the race to the bottom on precision; MoE at extreme scale + single-kernel fused MoE; generative recommenders (transformer/LLM-based recsys replacing two-tower retrieval); world models and model-based RL (Dreamer-style); LLM agents and tool use; the LLM-writes-GPU-kernels loop (KernelBench, CUDA-writing agents); diffusion language models. Each is a multi-year obsession in itself.
  • If LLMs click -> go deeper: CS336 -> post-training (RLHF/DPO/GRPO) -> inference systems (vLLM internals, paged attention) -> kernels. If vision clicks -> CS231n -> diffusion math -> latent diffusion / video generation -> multimodal/VLMs. If RL clicks -> Sutton&Barto -> CS285 -> offline/model-based RL -> RLHF-as-RL. If recsys clicks -> two-tower -> sequence models -> generative recommenders -> the serving systems + OPE. If MLsys clicks -> Triton -> FlashAttention -> distributed training (FSDP/Megatron) -> custom inference engines. Each arrow is one season.
  • After this path: the natural sequels are (1) original research — find an unsolved problem in your field and publish (workshop paper -> conference); (2) build a product on your depth; (3) become the maintainer/teacher — your reproductions and write-ups become the resource others learn from (the lucidrains / Costa Huang path). Depth in one field is the launchpad, not the destination — and the reproduction + PR portfolio you built here is exactly what a frontier-lab or staff-ML-systems hiring loop wants to see.
How you'll know you've actually got it
  • You can take an arbitrary recent paper in your chosen field, read it once with the three-pass method, and implement its core contribution in a day or two — including the details the paper glossed over.
  • You reproduced at least one seminal paper to its reported number (matched, not approximated) and can explain precisely where your first three attempts failed and why.
  • You have a merged, non-trivial PR in a flagship open-source ML project in your field, and you understood the codebase well enough to defend the change against maintainer review.
  • When a model misbehaves, you debug from first principles (shapes, gradients, numerics, data, systems) instead of guessing hyperparameters — and you usually find it within a session.
  • You can derive the core math of your field on a whiteboard from memory: attention, the diffusion ELBO, the Bellman optimality equation, a two-tower InfoNCE loss, or a roofline analysis.
  • You have strong, specific opinions about your field's frontier and can name the 5 papers/people that matter most and why — and where the current approaches are weak.
  • You've taught it: a blog post, talk, or explainer that someone else actually used to learn the topic, with runnable code.
  • You can estimate compute, memory, and latency for a model on the back of an envelope before running anything, and you're usually within a factor of 2.
  • You read your field's flagship conference proceedings each year and skim arXiv weekly without it feeling like a chore — it feels like keeping up with friends.
  • You chose ONE field and stuck with it for a full season without field-hopping, and the depth visibly compounds: each new paper is easier than the last.

← all roadmaps · back to hub