Machine Learning (core)
Become a from-scratch-fluent ML practitioner: implement the classical algorithms, backprop, and a transformer by hand, win at real data on Kaggle, and reproduce a published paper from its equations — understanding every line, not just calling .fit().
Most "ML engineers" are API operators who panic the moment a model misbehaves, because they never built the machinery.
The roadmap
Stage 1 — Classical ML from scratch, then with scikit-learn · 8-12 weeks weeks
Implement the core supervised and unsupervised algorithms in pure NumPy until you can derive their loss functions and update rules from memory, then re-do them in scikit-learn and learn to trust the bias-variance / regularization story in your bones.
Concepts, resources and problems
Concepts — Linear regression: the normal equation, gradient descent, MSE loss derived from maximum likelihood under Gaussian noise · Logistic regression: sigmoid, cross-entropy loss, the log-odds interpretation, why you can't use MSE · Regularization: L1 (Lasso) vs L2 (Ridge), what they geometrically do to weights, elastic net · The bias-variance tradeoff made concrete: learning curves, underfitting vs overfitting, why more data sometimes doesn't help · k-Nearest Neighbors and the curse of dimensionality · Decision trees: information gain / Gini impurity, how a split is chosen, why trees overfit · Ensembles: bagging, random forests, gradient boosting (the intuition behind XGBoost, and why boosting fits residuals) · Support Vector Machines: the margin, the kernel trick, soft margins, the dual formulation · Unsupervised: k-Means (and why it's just EM in disguise), PCA from the covariance matrix / SVD · Cross-validation, train/val/test discipline, data leakage, and proper evaluation metrics (precision/recall/F1, ROC-AUC, confusion matrices)
Read — An Introduction to Statistical Learning (ISLP, Python edition) — free PDF · The Hundred-Page Machine Learning Book · The Elements of Statistical Learning (ESL) — free PDF from the authors · scikit-learn User Guide
Watch — 3Blue1Brown — Essence of Linear Algebra · StatQuest with Josh Starmer — Machine Learning playlist · Andrew Ng — Supervised Machine Learning: Regression and Classification (Course 1, ML Specialization)
Problems
mediumImplement linear + logistic regression from scratch in NumPy (closed-form AND gradient descent), then match sklearn to 6 decimal places — including the regularization path for Ridge/Lasso — The foundational from-scratch build. The hard part isn't the forward pass — it's deriving the gradients by hand and getting the regularization term exactly right (including coordinate descent for Lasso's L1 kink) so your numbers match sklearn.hardBuild a decision tree from scratch (recursive splitting on information gain), then a random forest by bagging your own trees, then a gradient-boosted tree ensemble (your own mini-XGBoost that fits residuals) — Recursive tree induction with correct stopping criteria and impurity math is a genuine engineering challenge. Implementing boosting on top — fitting each tree to the gradient of the loss — is where you truly understand why XGBoost dominates tabular data.hardImplement k-Means and PCA from scratch (PCA via BOTH the covariance eigendecomposition AND the SVD, proving they agree), then reproduce the eigenfaces demo on a face dataset — Connecting linear algebra to ML two independent ways forces real understanding. Eigenfaces is the 'whoa' moment where the math produces something visual and real.brutalImplement a soft-margin SVM from scratch — solve the dual with a simple SMO/coordinate-ascent loop and add an RBF kernel — then match sklearn's SVC decision boundary — The Jane-Street-energy classical problem: deriving and coding the dual optimization with the kernel trick and KKT conditions is the hardest from-scratch build in classical ML. Most people never do it — that's exactly why you should.mediumKaggle: Titanic — get a clean, leak-free cross-validated submission, then climb the leaderboard with feature engineering — Your first real-data fight. The hard problem isn't the model — it's not fooling yourself with leakage, and squeezing signal out of features by hand. Where to find more: Kaggle's 'Getting Started' competitions tab, then the monthly 'Playground Series'.
Done when — You have a personal GitHub repo of from-scratch implementations (linreg, logreg, tree, random forest, gradient boosting, k-means, PCA, and a kernel SVM) whose outputs match scikit-learn to several decimals, plus a cross-validated Titanic submission on the public leaderboard. You can explain, on a whiteboard, why L2 shrinks weights and L1 zeroes them, and why boosting fits residuals.
Stage 2 — Neural network fundamentals — backprop from scratch · 6-10 weeks weeks
Build automatic differentiation and a multi-layer perceptron from absolute zero (a micrograd-style engine), then understand training dynamics deeply enough to debug a training run that's going wrong.
Concepts, resources and problems
Concepts — The neuron as a composition of differentiable operations; the computational graph · Backpropagation = the chain rule applied recursively over the graph; reverse-mode autodiff · Building an autograd engine: forward pass stores a graph, backward pass accumulates gradients (and why you must zero them) · MLPs: layers, activation functions (ReLU, tanh, sigmoid, GELU), the universal approximation intuition · Loss functions: MSE, cross-entropy, and the softmax that feeds it (and the numerically-stable log-sum-exp trick) · Optimization: SGD, mini-batches, momentum, Adam, learning rate, and why the learning rate is the most important hyperparameter · Training dynamics: vanishing/exploding gradients, weight initialization (Xavier/He), the importance of normalizing inputs · Regularization for nets: dropout, weight decay, early stopping, data augmentation · Batch normalization: what it does to the loss landscape and why it helps (and what it does to gradient statistics) · Diagnosing training: reading loss curves, the train/val gap, overfitting vs an actual bug
Read — Michael Nielsen — Neural Networks and Deep Learning (free online book) · Karpathy — 'A recipe for training neural networks' blog post · CS231n course notes — Optimization, Backprop, Neural Nets modules
Watch — Andrej Karpathy — The spelled-out intro to neural networks and backpropagation: building micrograd · Andrej Karpathy — Neural Networks: Zero to Hero (full playlist) · 3Blue1Brown — Neural Networks series (esp. 'Backpropagation calculus')
Problems
hardBuild micrograd yourself, then EXTEND it: add tanh/ReLU/exp/pow as differentiable ops, then implement a 2-layer MLP and train it to classify a non-linear toy dataset (sklearn's make_moons) — Writing reverse-mode autodiff that correctly accumulates gradients across a shared graph is genuinely hard and deeply satisfying. Don't just copy Karpathy — close the video and rebuild it from the Value class up.hardDo the micrograd lecture exercises: derive backprop by hand for a small expression, then verify your autograd engine against PyTorch's gradients to machine precision — The manual-gradient derivation is the Jane-Street-energy part: no framework, just you and the chain rule. Matching PyTorch to machine precision is the proof you got it right.brutalImplement an MLP from scratch in pure NumPy (no autograd) for MNIST — forward and backward passes written by hand, with softmax + cross-entropy and Adam — and reach >98% test accuracy — Hand-coding the full vectorized backward pass of a multi-layer net with softmax + cross-entropy and an Adam optimizer, with correct gradient shapes across batches, is a rite of passage. When 98% prints, you've earned it.brutalDo the CS231n assignment 1+2 layers COLD: implement affine/ReLU/softmax/batchnorm/dropout forward+backward in NumPy, gradient-check each, then build a fully-connected net — The canonical academic version of this stage. Numerical gradient-checking every layer you write is the discipline that makes your hand-coded backprop trustworthy — and the assignments are famously rigorous.hardBreak your own net on purpose: induce vanishing gradients, dead ReLUs, and divergence, then fix each — keep a debugging log — Most people only ever see nets that work. Deliberately breaking and fixing training dynamics builds the diagnostic intuition that makes you dangerous.
Done when — You've built a working autograd engine from scratch, trained a hand-coded NumPy MLP to >98% on MNIST without any framework, and gradient-checked every layer of a CS231n-style net. You can sketch the backward pass of a 2-layer net on a whiteboard and explain why a learning rate that's 10x too high diverges.
Stage 3 — The modern toolkit — PyTorch, CNNs, attention, embeddings · 10-14 weeks weeks
Become fluent in PyTorch (you've earned it by suffering through NumPy), then build the modern architecture families — CNNs for vision, embeddings + attention for sequences — culminating in a from-scratch transformer.
Concepts, resources and problems
Concepts — PyTorch mechanics: tensors, autograd, nn.Module, optimizers, the training loop, moving to GPU, and how .backward() is just your micrograd at scale · Convolutions: filters, stride, padding, receptive fields, feature maps, why weight-sharing + locality beats dense layers on images · Classic CNN architectures: LeNet -> AlexNet -> ResNet, and what residual connections solve for gradient flow · Embeddings: dense vector representations of discrete tokens; word2vec intuition; why embeddings are learned lookup tables · Sequence models: RNNs, the vanishing-gradient problem in time, LSTMs/GRUs as a partial fix · Self-attention: queries, keys, values; softmax(QKᵀ/√d)V; why attention replaced recurrence; positional encodings · The transformer block: multi-head attention, causal masking, layer norm, residual connections, the feed-forward sublayer · Transfer learning and fine-tuning: standing on pretrained shoulders · Tokenization (BPE) and the encoder/decoder vs decoder-only distinction
Read — The Annotated Transformer (Harvard NLP) · Jay Alammar — The Illustrated Transformer · Dive into Deep Learning (d2l.ai) — free interactive book · Attention Is All You Need (the original paper)
Watch — Andrej Karpathy — Let's build GPT: from scratch, in code, spelled out · fast.ai — Practical Deep Learning for Coders (full course) · Karpathy — makemore series (building language models, WaveNet, BatchNorm internals)
Problems
hardPort your Stage-2 MNIST MLP to PyTorch, then build a CNN that beats it — implement a mini-ResNet with residual blocks from scratch (nn.Module, no torchvision models) and train CIFAR-10 to >90% — Re-implementing residual blocks yourself reveals exactly what the skip connection does to gradient flow. CIFAR-10 to 90% from your own ResNet (not a pretrained one) is a real bar, not a toy.brutalBuild a transformer from scratch in PyTorch (multi-head causal self-attention, positional encoding, the full block) and train a character-level language model on a text corpus — The capstone from-scratch build of the path. Implementing causal masked multi-head attention with correct shapes, then watching it generate Shakespeare-ish text, is the single most rewarding milestone in modern ML. Then study nanoGPT to see the production version.hardImplement self-attention from the equations alone (no tutorial open) and unit-test that your output matches torch.nn.MultiheadAttention bit-for-bit — Closing every tutorial and reconstructing attention from the paper's equations — softmax(QKᵀ/√d)V, heads, masking — is the proof you actually understand it rather than having memorized Karpathy's keystrokes.brutalDo CS231n's RNN/Transformer assignment cold: implement vanilla RNN, LSTM, and self-attention layers (forward+backward where applicable) for image captioning — The rigorous academic counterpart to Karpathy's build. Implementing the recurrence and attention math under a test harness that grades your gradients is where shaky understanding gets exposed.mediumKaggle: Digit Recognizer — get a CNN onto the leaderboard, then ablate (no augmentation vs augmentation, with/without batchnorm, with/without residual connections) and quantify each change — Connects your from-scratch CNN to real competition mechanics and forces disciplined ablation — measuring what actually moves the needle instead of cargo-culting tricks.
Done when — You've trained a from-scratch transformer that generates coherent text, a from-scratch ResNet-style CNN to >90% on CIFAR-10, and passed CS231n's hardest assignments cold. You're fluent enough in PyTorch to implement any architecture from its paper, and can explain self-attention's Q/K/V math without notes.
Stage 4 — The practitioner layer — data, evaluation, and winning at Kaggle · 8-12 weeks weeks
Cross the gap between 'works on MNIST' and 'works on messy real data nobody cleaned for you' — data pipelines, rigorous evaluation, fighting overfitting in the wild, and competing seriously on Kaggle.
Concepts, resources and problems
Concepts — Real data pipelines: loading, cleaning, handling missing values, categorical encoding, feature scaling — and doing it without leakage · Feature engineering as the highest-leverage skill: domain-driven features, target/mean encoding done safely (out-of-fold), interaction terms · Robust evaluation: stratified k-fold, group/time-series cross-validation, why your CV must mirror the test split · The validation set is sacred: how to set up a CV scheme you can trust so the leaderboard doesn't surprise you · Overfitting in the wild: leaderboard overfitting, the public/private split, why your local CV is more trustworthy than the public LB · Class imbalance, proper metrics for it, threshold tuning, probability calibration · Gradient boosting in practice: XGBoost / LightGBM / CatBoost, the tabular workhorses, and hyperparameter tuning (Optuna) · Ensembling and stacking: blending diverse models to win the last few decimal places · Experiment discipline: tracking runs (Weights & Biases / MLflow), reproducibility, seeds, and not fooling yourself · Error analysis: looking at what your model gets wrong and why, instead of staring at a single aggregate number
Read — Andrew Ng — Machine Learning Yearning (free book) · Kaggle Learn micro-courses + winning-solution write-ups · Approaching (Almost) Any Machine Learning Problem — Abhishek Thakur (free PDF on GitHub) · Google — Rules of Machine Learning (Martin Zinkevich)
Watch — fast.ai — Practical Deep Learning for Coders (the data + Kaggle-focused lessons) · Abhishek Thakur — Kaggle competition walkthroughs (YouTube channel) · StatQuest — XGBoost, ROC/AUC, and Cross-Validation explained
Problems
hardKaggle: House Prices — Advanced Regression Techniques. Build a leak-free CV pipeline, engineer features from all 79 columns, and reach top ~10% with a tuned gradient-boosting + linear-model ensemble — The classic tabular grind. The hard part is disciplined feature engineering and a CV scheme whose score actually predicts the leaderboard — the core practitioner skill.brutalEnter a LIVE Kaggle competition (a Playground Series monthly or any active tabular comp) and finish in the top 25% of real, motivated competitors — A finished competition has all answers in the discussion forum; a live one does not. Competing blind against thousands of people, with a private leaderboard you can't see, is the realest test of whether you can actually do ML.brutalGo for a medal: in a LIVE Featured or Playground competition, finish top 10% (silver-medal territory) with a trustworthy CV and a stacked ensemble — then post your write-up — This is where the real difficulty lives: real money, Grandmasters, a hidden private leaderboard, and no forum answers. A medal is an externally-verified mark of skill that top-25% is not.hardDeliberately overfit the public leaderboard, then watch your private score collapse — then fix your CV so it never happens again. Write up what you learned. — Feeling leaderboard overfitting in your own gut — the public-vs-private gap — is the lesson that makes you trustworthy. You can't learn it from a blog; you have to get burned once.hardReproduce a classic tabular result with an XGBoost/LightGBM/CatBoost bake-off plus a stacked ensemble, tuned with Optuna, and quantify exactly how much each layer of complexity buys you — Forces you to confront the diminishing returns of complexity — often a well-tuned single model beats a baroque stack, and knowing when to stop is mastery.
Done when — You've earned a medal (or at minimum placed top 25%) in a LIVE Kaggle competition against real competitors, with a cross-validation scheme you trust more than the public leaderboard — and you were right. You can take a raw, messy CSV and produce a leak-free, well-evaluated model with a write-up explaining every decision.
Stage 5 — Paper reproduction & consolidation — earn the right to specialize · 8-14 weeks weeks
Prove you can stand on the frontier's doorstep: read a real ML paper and reproduce its core result from scratch, then consolidate everything into a coherent mental model so any specialization (LLMs, RL, vision, systems) is a short hop.
Concepts, resources and problems
Concepts — How to read an ML paper: the three-pass method (skim, read, reproduce), finding the one key idea · Reproducibility as a discipline: matching reported numbers, controlling seeds, the gap between paper and reality · Reading code alongside papers: nanoGPT, minGPT, and reference implementations as a learning tool · Connecting the dots: how regularization, optimization, and architecture choices recur across every model you've built · Compute literacy: training on GPUs, mixed precision (bf16/fp16), gradient accumulation, when you're memory-bound vs compute-bound · Knowing the map of specializations (LLMs/NLP, computer vision, RL, generative/diffusion, ML systems/MLOps, interpretability) so you can choose your next obsession deliberately · Communicating ML: writing up a result clearly enough that someone else could reproduce it (the Distill standard) · Building intuition for what's hard vs easy in current research, and where the open problems are
Read — Distill.pub — the archive of clearly-explained ML research · Lilian Weng's blog (Lil'Log) · ML Reproducibility Challenge (official site) · arXiv cs.LG / cs.CL — and the 'foundational papers' you've earned the background to read
Watch — Yannic Kilcher — paper explanation videos · Karpathy — Let's reproduce GPT-2 (124M) · fast.ai — Part 2: From Deep Learning Foundations to Stable Diffusion
Problems
brutalPick ONE classic paper (ResNet, Dropout, BatchNorm, word2vec, or the original GAN) and reproduce its headline result from scratch, matching the reported numbers within a reasonable margin — The single most valuable thing you can do at this level. Reading equations and turning them into working, number-matching code is exactly what separates practitioners from researchers. Where to find more: the MLRC accepted-reports list and arXiv.brutalReproduce GPT-2 (124M) following Karpathy, then make ONE original modification and measure its effect (a different positional encoding — e.g. RoPE, a different activation, or a different optimizer) — Reproduction proves you can rebuild the state of the art; an original ablation proves you can ASK a question and answer it empirically — the first taste of actual research.brutalImplement and reproduce a deep RL algorithm (DQN or PPO) from the paper using OpenAI Spinning Up as your reference, and get it to actually learn a control task — RL is notoriously the hardest thing to reproduce — tiny bugs silently break training and there's no loss curve telling you you're wrong. Getting PPO to actually learn is a legendary rite of passage and the ultimate 'debug from first principles' test.hardWrite a Distill-quality blog post explaining one concept you struggled with (backprop, attention, or batchnorm) so clearly that a past-you would have understood it on first read — Teaching is the final compression test of understanding. If you can't explain it cleanly with your own diagrams, you don't fully own it yet. This is also how you build a public reputation in the field.
Done when — You've reproduced a published ML result from the paper alone (numbers matching), made and measured one original modification, and written a public explainer good enough that strangers thank you for it. You now know the field's map well enough to choose your next specialization on purpose, not by accident.
Projects
- scikit-from-scratch: a classical ML library in pure NumPy — A clean, tested Python package implementing linear & logistic regression (closed-form + gradient descent), a decision tree, a random forest, gradient boosting, k-NN, k-means, and PCA — each with a fit/predict API mirroring sklearn, and a test suite that asserts your outputs match sklearn within tolerance.
- micrograd++: an autograd engine and neural-net library you built — A from-scratch reverse-mode automatic differentiation engine (à la Karpathy's micrograd) extended into a usable mini-framework: Tensor/Value class with a full set of differentiable ops, an nn module with Linear/MLP/activation layers, optimizers (SGD + momentum + Adam), and loss functions — all trainable on MNIST to >98%.
- A from-scratch Transformer + character-level language model — Implement a transformer from the equations up in PyTorch — multi-head causal self-attention, positional encodings, the full block with layer norm and residuals — and train it into a working character-level language model that generates plausible text from a corpus of your choice (Shakespeare, code, your own writing).
- A serious live Kaggle campaign for a medal — Pick an active Kaggle competition and run a full, disciplined campaign: thorough EDA, a leak-free cross-validation scheme you trust, iterative feature engineering, a tuned model or stacked ensemble, experiment tracking, and a public write-up of your approach and what you learned. Goal: a medal (top 10%), or at minimum top 25%, against real competitors.
- Paper reproduction: rebuild a published result from scratch (FLAGSHIP) — Choose one landmark paper (ResNet, BatchNorm, Dropout, word2vec, the original GAN, or reproduce GPT-2 124M with Karpathy) and reproduce its headline result from the paper alone — matching the reported numbers — then make and measure one original modification of your own. The portfolio centerpiece of this entire path.
Going harder
Hard problem arena — 9 brutal problems
hardKarpathy's micrograd / makemore / 'build GPT from scratch' challenges — The defining from-scratch gauntlet for this path. Close the video, rebuild the autograd engine, the MLP, and the transformer from a blank file, then match PyTorch's gradients to machine precision. The exercises at the end of each lecture are where the real difficulty hides.hardStanford CS231n assignments, done cold — Implement affine/ReLU/softmax/batchnorm/dropout layers, full backprop, a CNN, and RNN/LSTM/attention from scratch in NumPy under a strict gradient-checking harness, then in a framework. The canonical academic version of this roadmap's from-scratch builds — famously rigorous, and unforgiving of hand-waving.brutalReproduce GPT-2 (124M) end-to-end — Rebuild a real published model from scratch, confronting the compute, data, and engineering reality professionals face (mixed precision, data loading, ~$10 of GPU). Matching the loss curve is brutal; making one original, measured modification afterward is where you cross from student to researcher.legendaryML Reproducibility Challenge — reproduce a top-conference paper — A structured, community-run attempt to reproduce papers accepted at NeurIPS/ICML/ICLR, with a real venue to submit your report to. Pick one, rebuild its core claim from the equations, and report whether it holds. As close to frontier research as a self-learner can get.legendaryLive 'Featured' Kaggle competitions (prize-money tier) — The hardest competitive ML on Earth lives here: real money, Grandmasters, a hidden private leaderboard, and no answers in the forum because nobody knows them yet. A medal (especially gold/solo-gold) is a genuine, externally-verified mark of skill.legendaryReproduce a deep RL algorithm (PPO/DQN) until it actually learns — RL is notoriously the hardest thing to reproduce — tiny bugs silently break training with no error and no loss signal to guide you. Implementing PPO or DQN from the paper and getting it to actually solve a control task is a legendary rite of passage in debugging from first principles.brutalJane Street Monthly Puzzles — The original 'go hard' energy: math, logic, and probability puzzles released roughly monthly with a deep archive. Not ML-specific, but exactly the lateral problem-solving and probabilistic reasoning that sharpens the mind ML rewards. Race the archive when you want a pure brain-bender.brutalProject Euler (hard tier, 300+) — Computational math problems where brute force is hopeless and you need an insight. The higher-numbered problems demand genuine mathematical cleverness plus efficient code — the same muscle that makes you good at the numerics under ML.hardAdvent of Code (later days) — An annual December puzzle series whose final-week problems combine algorithmic depth with brutal edge-case handling under time pressure. The whole archive is open year-round — a fast, addictive way to keep the implement-it-correctly-and-fast muscle warm between ML builds.
Keep curious
Blogs, people, communities, rabbit holes
- BLOGS — Lilian Weng's Lil'Log (lilianweng.github.io) for deep survey posts; Chris Olah & the Distill.pub archive (distill.pub) for the gold standard of ML explanation; Sebastian Raschka's 'Ahead of AI' (magazine.sebastianraschka.com) for rigorous, practitioner-grade deep dives; Jay Alammar (jalammar.github.io) for the best visual explainers; Karpathy's blog (karpathy.github.io) — read every post, twice.
- NEWSLETTERS — The Batch by Andrew Ng / DeepLearning.AI (deeplearning.ai/the-batch, the canonical weekly field digest); Sebastian Raschka's Ahead of AI; Import AI by Jack Clark (jack-clark.net, policy + frontier signal); The Gradient (thegradient.pub) for thoughtful long-form.
- PEOPLE TO FOLLOW — Andrej Karpathy (@karpathy, the patron saint of learning ML from scratch), Jeremy Howard (@jeremyphoward, fast.ai), Sebastian Raschka (@rasbt), Andrew Ng (@AndrewYNg), Yann LeCun (@ylecun), Chris Olah (@ch402), Lilian Weng (@lilianweng), François Chollet (@fchollet), Jay Alammar (@jayalammar). Follow them on X and their blogs, and read what they read.
- COMMUNITIES — r/MachineLearning (papers & discussion) and r/learnmachinelearning (the learning journey); the fast.ai forums (forums.fast.ai) — one of the kindest, highest-signal learning communities online; Kaggle's discussion forums (the single best place to learn how strong practitioners actually think); the EleutherAI and Hugging Face Discords for open-source frontier work; the nanoGPT GitHub discussions.
- COMPETITIONS — Kaggle is home base (start with Getting Started, graduate to the monthly Playground Series, then Featured competitions for medals). Also: the ML Reproducibility Challenge (reproml.org, now a NeurIPS track), DrivenData (drivendata.org) for social-impact competitions, and AIcrowd (aicrowd.com) for research-flavored challenges.
- CONFERENCES (watch the talks free online) — NeurIPS, ICML, ICLR are the big three for ML research; CVPR/ICCV for vision, ACL/EMNLP for NLP. You don't need to attend — recorded talks, tutorials, and the papers are all online, and skimming each year's 'best paper' awards is a great frontier pulse-check.
- PRIMARY SOURCES — arXiv (arxiv.org, cs.LG / cs.CL / cs.CV) for where the field publishes, and Hugging Face Papers (huggingface.co/papers) to find papers WITH runnable code and community discussion (it absorbed the now-defunct Papers with Code). Build the habit of reading one paper a week with the three-pass method; reproduce one a quarter.
- TOOLS TO GO DEEP ON — PyTorch internals and torch.compile; Hugging Face (transformers, datasets, the ecosystem, and the free huggingface.co/learn courses); Weights & Biases or MLflow for experiment tracking; nanoGPT and minGPT as reference implementations to study line by line; Optuna for hyperparameter search.
- FRONTIER RABBIT HOLES — mechanistic interpretability (Anthropic's transformer-circuits.pub, and Neel Nanda's 'how to get started' guide on the Alignment Forum + his TransformerLens library — 'what is the model actually computing?'); the scaling-laws literature (Chinchilla, the original Kaplan paper); diffusion & generative modeling (fast.ai Part 2 reimplements Stable Diffusion from scratch); efficient training / systems-for-ML, a natural fit for your backend brain — FlashAttention, quantization (GPTQ/AWQ), and distributed training (FSDP, DeepSpeed).
- IF THIS CLICKS, GO DEEPER HERE — LLMs/NLP: Karpathy's later videos + the Hugging Face course (huggingface.co/learn); Deep RL: OpenAI Spinning Up (spinningup.openai.com) + David Silver's RL course on YouTube; Computer Vision: finish CS231n; ML Systems/MLOps: 'Designing Machine Learning Systems' by Chip Huyen + Made With ML (madewithml.com); Theory: ESL cover-to-cover, then Kevin Murphy's 'Probabilistic Machine Learning' (probml.github.io, free).
- MATH BACKBONE TO KEEP SHARPENING — 3Blue1Brown for geometric intuition; Mathematics for Machine Learning (mml-book.github.io, free) when you want the linear algebra / calculus / probability foundations made rigorous; Boyd & Vandenberghe's Convex Optimization (stanford.edu/~boyd/cvxbook, free) once optimization fascinates you.
How you'll know you've actually got it
- You can derive backpropagation for an arbitrary feed-forward network on a whiteboard, from the chain rule, with no notes — and explain why a learning rate 10x too large diverges.
- Given a raw, messy CSV nobody cleaned, you can produce a leak-free, properly cross-validated model with a write-up justifying every preprocessing and evaluation choice.
- You've implemented linear/logistic regression, a decision tree, a kernel SVM, a transformer, and an autograd engine from scratch, and your gradients match PyTorch to machine precision.
- You can read a new ML paper with the three-pass method and reproduce its core result from the equations alone — matching the reported numbers within a reasonable margin.
- When a training run misbehaves, you debug it methodically from first principles (data? init? LR? a real bug?) instead of randomly tweaking hyperparameters and hoping.
- You've earned a medal (or at least placed top 25%) in a LIVE Kaggle competition against motivated competitors — with a local CV score you trusted more than the public leaderboard, and were right.
- You can explain the bias-variance tradeoff, regularization, and self-attention clearly enough that a confused beginner suddenly gets it — teaching is effortless because the understanding is real.
- You instinctively reach for the simplest model that could work and can articulate exactly when added complexity (a deeper net, a bigger ensemble) is and isn't worth it.
- You've written at least one public explainer or reproduction that strangers found genuinely useful — your understanding now produces artifacts others learn from.
- You no longer feel the field is a pile of disconnected tricks: regularization, optimization, and architecture feel like one coherent story, and you can place any new technique on that map.