My Learning Hub

Mathematics for ML and Quant

Build the rigorous mathematical bedrock — linear algebra, optimization, probability, and stochastic processes — shared by machine learning and quantitative finance, with proof-level fluency and the ability to crack Jane Street / Putnam-grade problems.

Every model you'll ever build in ML or quant is math wearing an API.

The roadmap

Stage 1 — Linear Algebra: The Geometry of Data · 8–10 weeks weeks
Stop seeing matrices as number grids and start seeing them as linear transformations. Develop visual and algebraic fluency with vector spaces, bases, eigendecomposition, and SVD — the single most-used math in ML.

Concepts, resources and problems

Concepts — Vectors, vector spaces, span, linear independence, basis, dimension · Linear transformations as the meaning of a matrix; the matrix as basis-dependent coordinates · Matrix multiplication as composition; the four fundamental subspaces (column, row, null, left-null) · Systems of equations, rank, Gaussian elimination, LU decomposition · Determinant as signed volume scaling; geometric intuition over cofactor formulas · Dot products, projections, orthogonality, Gram-Schmidt, QR decomposition · Eigenvalues and eigenvectors: invariant directions; diagonalization; what it means geometrically · Symmetric matrices and the spectral theorem; positive (semi)definiteness · Singular Value Decomposition: the crown jewel — every matrix as rotate-stretch-rotate; low-rank approximation and the Eckart-Young theorem · Change of basis; why PCA is just eigendecomposition of a covariance matrix

ReadMathematics for Machine Learning — Ch. 2 (Linear Algebra), Ch. 3 (Analytic Geometry), Ch. 4 (Matrix Decompositions) · Introduction to Linear Algebra (6th ed.) — Gilbert Strang (companion to MIT 18.06) · MIT 18.06SC Linear Algebra — full course materials, psets, exams, solutions · immersivemath — Immersive Linear Algebra (interactive figures) · Linear Algebra Done Right (4th ed., free) — Sheldon Axler

Watch3Blue1Brown — Essence of Linear Algebra (full playlist) · MIT 18.06 — Gilbert Strang lecture series · Steve Brunton — Singular Value Decomposition series

Problems

Done when — Implement PCA and SVD from scratch in NumPy (no np.linalg.svd / no sklearn — use power iteration or the QR algorithm), use them to compress a real image and reduce a real dataset, AND derive on paper why the top-k singular vectors give the optimal low-rank approximation. You can explain — to a whiteboard — what each of the four fundamental subspaces is for any given matrix.

Stage 2 — Calculus, Multivariable & Optimization: How Models Learn · 8–10 weeks weeks
Master gradients, the chain rule in many dimensions, and the optimization machinery (convexity, Lagrange multipliers, gradient descent) that is the learning step in every ML model.

Concepts, resources and problems

Concepts — Limits, derivatives, the derivative as best linear approximation; Taylor series · Multivariable calculus: partial derivatives, the gradient, directional derivatives · The Jacobian and Hessian; what they tell you about a surface · The multivariable chain rule — the mathematical heart of backpropagation · Gradient descent, learning rates, momentum; why we follow the negative gradient · Convex sets and convex functions; why convexity guarantees a global minimum · Constrained optimization: Lagrange multipliers and the geometry of tangent constraints · KKT conditions and Lagrangian duality (a first taste) · Vector and matrix calculus: differentiating scalar-by-vector and vector-by-vector (the 'matrix cookbook' skills) · Automatic differentiation: forward vs reverse mode, and why reverse mode = backprop

ReadMathematics for Machine Learning — Ch. 5 (Vector Calculus), Ch. 7 (Continuous Optimization) · Convex Optimization — Boyd & Vandenberghe (free PDF, Stanford) · The Matrix Cookbook — Petersen & Pedersen (official DTU PDF) · Khan Academy — Multivariable Calculus (free)

Watch3Blue1Brown — Essence of Calculus (full playlist) · 3Blue1Brown — Neural Networks (backpropagation episodes) · Stanford EE364A — Convex Optimization I (Stephen Boyd, full lectures)

Problems

Done when — Build a tiny autograd engine from scratch (à la micrograd): a scalar Value class that records operations and backpropagates gradients through a small neural net you then train on real data. Separately, implement gradient descent AND Newton's method on a convex function and visualize their convergence. You can prove a given function is convex and set up a Lagrangian for a constrained problem without reference.

Stage 3 — Probability: Reasoning Under Uncertainty · 10–12 weeks weeks
Develop deep probabilistic intuition — the kind that cracks brain teasers instantly. Master combinatorics, random variables, expectation, the key distributions, conditioning, Bayes, and concentration inequalities.

Concepts, resources and problems

Concepts — Counting and combinatorics: the multiplication rule, permutations, combinations, stars-and-bars, inclusion-exclusion · Probability axioms, sample spaces, events; the naive vs general definition · Conditional probability, independence, the law of total probability, Bayes' rule (and Bayesian thinking) · Random variables: PMFs, PDFs, CDFs; expectation, variance, and their linearity/properties · Key distributions and when each arises: Bernoulli, Binomial, Geometric, Poisson, Uniform, Exponential, Normal, Beta, Gamma · Joint, marginal, conditional distributions; covariance and correlation · Expectation via indicator variables and the 'story' proofs (the Stat 110 superpower) · Conditional expectation E[X|Y] as a random variable; the tower property / law of iterated expectation · Moment generating functions; sums of random variables; convolution · Concentration & limit theorems: Markov, Chebyshev, the Law of Large Numbers, the Central Limit Theorem, a first look at Chernoff/Hoeffding bounds · Gambler's ruin, the matching/birthday/Monty Hall classics — and why they trip up intuition

ReadIntroduction to Probability — Blitzstein & Hwang (free full PDF) · Stat 110 — Strategic Practice & Homework Problems (with solutions) · Mathematics for Machine Learning — Ch. 6 (Probability & Distributions) · Fifty Challenging Problems in Probability — Mosteller (Dover)

WatchHarvard Stat 110 — Joe Blitzstein full lecture series · 3Blue1Brown — Probability series (Bayes, binomial, Central Limit Theorem) · MIT 6.041SC — Probabilistic Systems Analysis (Tsitsiklis)

Problems

Done when — Build a small Bayesian inference engine from scratch: implement conjugate-prior updating (Beta-Binomial and Normal-Normal) and a basic Metropolis-Hastings sampler, then use them to recover the parameters of a distribution you generated. Separately: solve at least 40 of Mosteller's 50 unaided, and write up three 'story proofs' (e.g. derive E[X] for the geometric and the expected number of fixed points of a random permutation) in your own words.

Stage 4 — Statistics & Stochastic Processes: From Data to Dynamics · 12–14 weeks weeks
Move from static probability to inference and processes that evolve in time: estimation, hypothesis testing, Markov chains, martingales, and a first taste of measure-theoretic rigor — the exact toolkit of quant modeling and statistical ML.

Concepts, resources and problems

Concepts — Estimation: method of moments, maximum likelihood (MLE), bias/variance, consistency, efficiency · The Cramér–Rao bound and Fisher information; what 'best estimator' means · Confidence intervals and the bootstrap; the sampling distribution · Hypothesis testing: null/alternative, p-values, type I/II error, power, the Neyman–Pearson lemma · Bayesian vs frequentist inference; priors, posteriors, credible vs confidence intervals · Markov chains: transition matrices, stationary distributions, ergodicity, mixing; PageRank as an eigenvector · Poisson processes and continuous-time Markov chains · Martingales: the defining property, the optional stopping theorem, and why they model fair games / no-arbitrage · Random walks, gambler's ruin revisited, hitting times · Brownian motion as the limit of random walks; a first look at Itô calculus and the role of stochastic calculus in pricing · A taste of measure-theoretic probability: why we need σ-algebras, the difference between 'almost surely' and 'surely', and what a measurable function is

ReadMathematics for Machine Learning — MLE/MAP & statistics sections (Ch. 6, 8–9) · MIT 6.262 — Discrete Stochastic Processes (Gallager, full materials) · All of Statistics — Larry Wasserman (free CMU PDF) · Probability: Theory and Examples — Rick Durrett (free PDF, Duke) · Stochastic Calculus for Finance II: Continuous-Time Models — Steven Shreve

WatchMIT 6.262 — Discrete Stochastic Processes lectures (Robert Gallager) · Steve Brunton — Eigenvalues/eigenvectors & PageRank (Markov-chain connection) · Ben Lambert — Maximum Likelihood & hypothesis-testing econometrics playlist

Problems

Done when — Build a Markov-chain Monte Carlo simulator AND a Markov-chain text generator from scratch: train transition probabilities on a real corpus to generate text, and use MCMC (Metropolis-Hastings) to estimate a quantity (e.g. π, or a Bayesian posterior) — then verify your simulated stationary distribution matches the analytic dominant eigenvector. Separately: implement MLE for a model from scratch and confirm it against the analytic estimator, and solve at least one full Jane Street puzzle end-to-end.

Projects

  • PCA / SVD from scratch (the 'see the math' project) — Implement PCA and the SVD from first principles in NumPy — no np.linalg.svd, no sklearn. Compute the eigendecomposition of the covariance matrix yourself (power iteration or the QR algorithm), then use it to (a) compress an image to k singular values and plot reconstruction error vs k, and (b) reduce a real dataset (e.g. MNIST or Iris) to 2D and visualize it. Validate your output against the library functions to within numerical tolerance.
  • micrograd-style autograd engine + tiny neural net — Build a scalar-valued automatic differentiation engine: a Value class that builds a computation graph and backpropagates gradients via the multivariable chain rule. Then build a small MLP on top of it, train it on a real classification task, and verify every gradient against finite differences. Use Karpathy's micrograd as a reference only after you've attempted it yourself.
  • Bayesian inference engine — Build a small library for Bayesian inference from scratch: conjugate-prior updates (Beta-Binomial, Normal-Normal, Gamma-Poisson) with closed-form posteriors, PLUS a general-purpose Metropolis-Hastings sampler for when there's no conjugacy. Demonstrate it by recovering parameters of distributions you generated, and by running a Bayesian A/B test on simulated data. Visualize how the posterior tightens as data arrives.
  • Markov-chain & Monte-Carlo simulator suite — A combined project tying Stages 3+4 together: (1) a Markov-chain text generator trained on a real corpus; (2) a Monte-Carlo engine that estimates hard-to-compute quantities (π, the value of a probability brain teaser, a Project Euler #84-style game); (3) a Markov-chain analyzer that computes stationary distributions both by simulation and by finding the dominant eigenvector, proving they agree. Reproduce the Monopoly-odds and gambler's-ruin classics numerically.
  • FLAGSHIP: Reproduce a foundational paper's math end-to-end — Pick one mathematically rich, reproducible paper and rebuild it from the equations up, using only the tools you've built and the math you've internalized. Strong candidates: the original PageRank paper (linear algebra + Markov chains), the Eckart–Young / latent-semantic-analysis low-rank theory, a Gaussian-process regression from scratch (linear algebra + probability + optimization), or a Black-Scholes derivation with Monte-Carlo validation. Write a clean technical report deriving every equation and showing your implementation matches the paper's results.

Going harder

Hard problem arena — 7 brutal problems
  • brutal Jane Street Monthly Puzzles — The defining 'puzzle energy' you asked for. Monthly problems from a top trading firm that almost always reduce to a clever probability, expected-value, optimization, or Markov-chain argument with a computational twist. The archive holds years of them with solutions. Start with older months (solutions available), then race the current month live before the deadline.
  • legendary The Putnam Competition Archive (Kedlaya) — The hardest undergraduate math competition in North America. Every year has 12 problems; the median score is famously near zero (often 0–1 out of 120). The probability, combinatorics, linear-algebra, and inequality/optimization problems are directly relevant — and solving even a few unaided is a real badge. Full solutions and commentary by Bhargava, Kedlaya & Ng included.
  • brutal Project Euler — hard tier (difficulty 70%+) — 900+ problems where the hardest demand combining several advanced ideas AND figuring out what NOT to compute. The 70%+ difficulty problems often need number theory, generating functions, clever DP, or Markov-chain setups, all under a ~1-minute CPU budget. The 'find the pattern, prove it, then compute it efficiently' loop is pure ML/quant training.
  • hard Fifty Challenging Problems in Probability (Mosteller) — Fifty compact, brutal probability brain teasers that recur in quant interviews to this day. No computer needed — pure insight. Solving all fifty unaided is a classic informal qualification for aspiring quants.
  • hard Quant interview problem books — the 'Green Book' (Zhou) and 'Heard on the Street' (Crack) — The two canonical quant-interview brain-teaser collections. The Green Book's brain-teaser, probability, and stochastic-process sections (200+ real problems) are the gold standard for Jane Street / Citadel / Optiver / SIG / HRT-style questions. If the goal is to be tested like a quant, these are the proving grounds.
  • hard Brilliant / 'Quant Guide' interview banks + xkcd-style estimathons — Curated, tagged, difficulty-rated banks of the exact probability / brain-teaser / market-making questions trading firms ask, with a leaderboard. Brainstellar (brainstellar.com) is a free, beloved companion. Grind the 'hard' and 'very hard' tiers under a timer to simulate live interview pressure.
  • legendary Paper reproductions (the research-grade arena) — The ultimate hard problem isn't a puzzle with a known answer — it's taking a dense paper (a Gaussian-process regression, a diffusion model's math, an optimal-transport result) and reproducing every equation and result from scratch. This is where 'I learned the math' becomes 'I can do the work that produces the math.' The hardest and most rewarding tier.

Keep curious

Blogs, people, communities, rabbit holes
  • BLOGS & WRITERS — Terence Tao's blog (terrytao.wordpress.com): the world's most famous mathematician thinking out loud, including his 'career advice' and problem-solving posts. Gregory Gundersen (gregorygundersen.com/blog): exquisitely clear from-scratch derivations of ML math (PCA, Gaussian processes, the reparameterization trick, OLS, Black-Scholes). Cosma Shalizi's notebooks (bactra.org/notebooks): an opinionated, deep map of statistics and stochastic processes. Count Bayesie (countbayesie.com) for probability intuition. Christopher Olah (colah.github.io) for the geometry of deep learning.
  • NEWSLETTERS & FEEDS — For quant: Wilmott (wilmott.com) for practitioner-level discussion, and the QuantStart blog (quantstart.com) for hands-on derivations. For ML math: the Distill.pub archive (distill.pub) and The Gradient (thegradient.pub). Track new work via arxiv-sanity-lite (arxiv-sanity-lite.com) filtered to stat.ML and q-fin, and Import AI / The Batch (deeplearning.ai) for weekly orientation.
  • YOUTUBE TO LIVE IN — 3Blue1Brown (intuition), Steve Brunton / 'Eigensteve' (data-driven math), MIT OCW (Strang, Tsitsiklis, Gallager), Mathologer (deep dives), and Numberphile (rabbit-hole fuel). For the proof-writing and Putnam mindset, Michael Penn grinds olympiad/Putnam problems almost daily, and blackpenredpen is great for hard integrals/calculus.
  • COMMUNITIES — r/math and r/learnmath for general math; r/quant and r/quantfinance for the trading side; QuantNet (quantnet.com) is THE forum for aspiring quants with program reviews and interview threads. The Art of Problem Solving community (artofproblemsolving.com/community) is where competition-math culture lives — invaluable for Putnam prep. Math Stack Exchange (math.stackexchange.com) and Cross Validated (stats.stackexchange.com) for getting unstuck with rigor.
  • COMPETITIONS & RECURRING CHALLENGES — Jane Street puzzles (monthly) and their Estimathon / electronic-trading games; the Putnam (first Saturday of December, register through your nearest university); Project Euler (self-paced, lifelong); QuantGuide.io and Brainstellar for tagged interview problems; Kaggle for applied ML. Many firms (Jane Street, Optiver, IMC, SIG) run public trading-game and puzzle events — watch their sites.
  • CONFERENCES & TALKS — NeurIPS, ICML, and ICLR (watch the free recorded talks and tutorials even without attending) for ML; the Joint Mathematics Meetings for pure-math culture. For quant: QuantMinds International and the Quant Conference, plus practitioner talks. Most post recordings online — the tutorial sessions are gold for seeing math applied at the frontier.
  • FRONTIER RABBIT HOLES (if this clicks, go deeper here) — Optimal transport and Wasserstein distances (the math behind diffusion models and modern generative AI); information geometry (treating distributions as a curved manifold, the natural gradient); the neural tangent kernel and why overparameterized nets generalize; rough-path theory and stochastic calculus for high-frequency finance; random matrix theory (the spectral statistics that govern both deep nets and portfolio covariance — Marchenko–Pastur, Tracy–Widom). Each is a multi-month obsession that pays off in both ML and quant.
  • BOOKS FOR THE LONG GAME — after this path: 'The Elements of Statistical Learning' (Hastie/Tibshirani/Friedman, free PDF at hastie.su.domains/ElemStatLearn) for statistical ML; 'Pattern Recognition and Machine Learning' (Bishop, free PDF from Microsoft Research) for the Bayesian view; 'Stochastic Calculus for Finance I & II' (Shreve) for the quant track; 'Probability: Theory and Examples' (Durrett, free) for full measure-theoretic rigor; 'Numerical Linear Algebra' (Trefethen & Bau) and 'Convex Optimization' (Boyd, free) to become genuinely dangerous with matrices and optimizers.
  • WHAT TO DO AFTER — fork the path: for ML, go into deep-learning theory, Gaussian processes, and optimization; for quant, into stochastic calculus, derivatives pricing, and time-series. Either way, the move that compounds is teaching: write up your derivations as blog posts. Explaining the SVD, the optional stopping theorem, or the Black-Scholes derivation to strangers is the fastest way to discover the gaps in your own understanding — and to build a public track record that opens doors.
How you'll know you've actually got it
  • You can derive backpropagation for an arbitrary feedforward network on a whiteboard, from the multivariable chain rule, with no reference — and explain why reverse-mode autodiff is the efficient way to do it.
  • Given any matrix, you can immediately describe its four fundamental subspaces, sketch what its SVD does geometrically, and explain why the truncated SVD is the optimal low-rank approximation (the Eckart–Young theorem) without looking it up.
  • You read a paper's equations — a posterior update, a loss derivation, a pricing formula — and can re-derive and implement them yourself, then validate your code against the paper's numbers.
  • You solve probability brain teasers (Mosteller-tier, Jane-Street-tier) by recognizing the right framing — indicator variables, conditioning, a martingale, a symmetry argument — within minutes rather than grinding casework.
  • You can set up and solve a constrained optimization with Lagrange multipliers / KKT conditions, and prove whether a given function is convex, as a reflex.
  • You instinctively reach for the right distribution given a 'story' (this is Poisson because…, this is a Beta because it's the conjugate prior for…) and can prove its expectation and variance from scratch.
  • You've solved at least one full Jane Street puzzle end-to-end and a handful of Putnam problems unaided — and you can write up your solution as a clean, rigorous proof, not just an answer.
  • You can model a real process as a Markov chain or martingale, compute its stationary distribution or expected hitting time, and apply the optional stopping theorem correctly.
  • Your from-scratch implementations (PCA, autograd, Bayesian sampler, MCMC, Markov-chain simulator) match the canonical libraries to numerical tolerance — proving you understand the math beneath the API, not just the API.
  • You can teach any of these topics to another engineer clearly, and you catch your own errors because you understand why each step holds, not just that it produces the expected number.

← all roadmaps · back to hub