Mathematics for ML and Quant

Build the rigorous mathematical bedrock — linear algebra, optimization, probability, and stochastic processes — shared by machine learning and quantitative finance, with proof-level fluency and the ability to crack Jane Street / Putnam-grade problems.

Every model you'll ever build in ML or quant is math wearing an API.

The roadmap

Stage 1 — Linear Algebra: The Geometry of Data · 8–10 weeks weeks
Stop seeing matrices as number grids and start seeing them as linear transformations. Develop visual and algebraic fluency with vector spaces, bases, eigendecomposition, and SVD — the single most-used math in ML.

Concepts, resources and problems

Concepts — Vectors, vector spaces, span, linear independence, basis, dimension · Linear transformations as the meaning of a matrix; the matrix as basis-dependent coordinates · Matrix multiplication as composition; the four fundamental subspaces (column, row, null, left-null) · Systems of equations, rank, Gaussian elimination, LU decomposition · Determinant as signed volume scaling; geometric intuition over cofactor formulas · Dot products, projections, orthogonality, Gram-Schmidt, QR decomposition · Eigenvalues and eigenvectors: invariant directions; diagonalization; what it means geometrically · Symmetric matrices and the spectral theorem; positive (semi)definiteness · Singular Value Decomposition: the crown jewel — every matrix as rotate-stretch-rotate; low-rank approximation and the Eckart-Young theorem · Change of basis; why PCA is just eigendecomposition of a covariance matrix

Read — Mathematics for Machine Learning — Ch. 2 (Linear Algebra), Ch. 3 (Analytic Geometry), Ch. 4 (Matrix Decompositions) · Introduction to Linear Algebra (6th ed.) — Gilbert Strang (companion to MIT 18.06) · MIT 18.06SC Linear Algebra — full course materials, psets, exams, solutions · immersivemath — Immersive Linear Algebra (interactive figures) · Linear Algebra Done Right (4th ed., free) — Sheldon Axler

Watch — 3Blue1Brown — Essence of Linear Algebra (full playlist) · MIT 18.06 — Gilbert Strang lecture series · Steve Brunton — Singular Value Decomposition series

Problems

medium MIT 18.06SC problem sets + exams (with full solutions) — Do every pset. They force you to compute by hand until the operations become muscle memory, then ask 'why' in the proof problems. The exams are a real check on whether you've internalized the four subspaces.
medium MML Book Ch. 2–4 exercises — More proof-flavored than Strang. Exercises on rank, null spaces, and the SVD construction make you prove the structural facts, not just compute.
hard Prove the Eckart–Young theorem AND the spectral theorem from scratch (unaided) — Genuinely hard: prove that the truncated SVD is the best rank-k approximation in both Frobenius and spectral norm, and that real symmetric matrices have orthonormal eigenbases. If you can do these with no reference, you own this stage.
brutal Putnam linear-algebra problems (rank, trace, nilpotents, characteristic polynomials) — Putnam linear-algebra problems (e.g. clever uses of rank inequalities, trace/eigenvalue identities, and the Cayley–Hamilton theorem) are where competition-grade matrix reasoning lives. Filter the archive for the algebra-flavored entries; solutions are included.
hard Project Euler #101 (Optimum Polynomial), #155 (Capacitor circuits), and exact-arithmetic linear-algebra entries — Forces you to make linear algebra computational and exact (rational arithmetic, no floating-point slop). The 'find the pattern, then prove it' loop is exactly the skill ML/quant work demands.

Done when — Implement PCA and SVD from scratch in NumPy (no np.linalg.svd / no sklearn — use power iteration or the QR algorithm), use them to compress a real image and reduce a real dataset, AND derive on paper why the top-k singular vectors give the optimal low-rank approximation. You can explain — to a whiteboard — what each of the four fundamental subspaces is for any given matrix.

Stage 2 — Calculus, Multivariable & Optimization: How Models Learn · 8–10 weeks weeks
Master gradients, the chain rule in many dimensions, and the optimization machinery (convexity, Lagrange multipliers, gradient descent) that is the learning step in every ML model.

Concepts, resources and problems

Concepts — Limits, derivatives, the derivative as best linear approximation; Taylor series · Multivariable calculus: partial derivatives, the gradient, directional derivatives · The Jacobian and Hessian; what they tell you about a surface · The multivariable chain rule — the mathematical heart of backpropagation · Gradient descent, learning rates, momentum; why we follow the negative gradient · Convex sets and convex functions; why convexity guarantees a global minimum · Constrained optimization: Lagrange multipliers and the geometry of tangent constraints · KKT conditions and Lagrangian duality (a first taste) · Vector and matrix calculus: differentiating scalar-by-vector and vector-by-vector (the 'matrix cookbook' skills) · Automatic differentiation: forward vs reverse mode, and why reverse mode = backprop

Read — Mathematics for Machine Learning — Ch. 5 (Vector Calculus), Ch. 7 (Continuous Optimization) · Convex Optimization — Boyd & Vandenberghe (free PDF, Stanford) · The Matrix Cookbook — Petersen & Pedersen (official DTU PDF) · Khan Academy — Multivariable Calculus (free)

Watch — 3Blue1Brown — Essence of Calculus (full playlist) · 3Blue1Brown — Neural Networks (backpropagation episodes) · Stanford EE364A — Convex Optimization I (Stephen Boyd, full lectures)

Problems

hard Derive backpropagation for a 2-layer MLP entirely by hand, then verify with finite differences — The rite of passage. Hand-derive every gradient via the multivariable chain rule, code it, and check each gradient numerically. If your analytic and numerical gradients match to 1e-7, you understand backprop at the bone.
hard Boyd & Vandenberghe end-of-chapter exercises + EE364A additional problems (Ch. 2–5) — Proof-heavy: prove a function is convex, derive a dual, characterize optimality with KKT. The EE364A course site posts additional exercises and solutions. These are genuinely demanding.
brutal Lagrange-multiplier & inequality brain teasers (max-entropy with constraints, largest inscribed box, AM-GM/Jensen problems) — Putnam analysis problems force elegant use of Lagrange multipliers, AM-GM, Cauchy-Schwarz, and convexity inequalities — the exact toolkit you'll reuse in ML loss design and quant. The inequality problems in particular are pure 'find the slick argument' training.
brutal Project Euler analysis/optimization & numeric-precision problems (#307, #587, geometry-meets-calculus tier) — These need real analytical setup before any code — you must do the calculus to know what to compute. Brutal but exactly the 'math then code' loop that defines this path.

Done when — Build a tiny autograd engine from scratch (à la micrograd): a scalar Value class that records operations and backpropagates gradients through a small neural net you then train on real data. Separately, implement gradient descent AND Newton's method on a convex function and visualize their convergence. You can prove a given function is convex and set up a Lagrangian for a constrained problem without reference.

Stage 3 — Probability: Reasoning Under Uncertainty · 10–12 weeks weeks
Develop deep probabilistic intuition — the kind that cracks brain teasers instantly. Master combinatorics, random variables, expectation, the key distributions, conditioning, Bayes, and concentration inequalities.

Concepts, resources and problems

Concepts — Counting and combinatorics: the multiplication rule, permutations, combinations, stars-and-bars, inclusion-exclusion · Probability axioms, sample spaces, events; the naive vs general definition · Conditional probability, independence, the law of total probability, Bayes' rule (and Bayesian thinking) · Random variables: PMFs, PDFs, CDFs; expectation, variance, and their linearity/properties · Key distributions and when each arises: Bernoulli, Binomial, Geometric, Poisson, Uniform, Exponential, Normal, Beta, Gamma · Joint, marginal, conditional distributions; covariance and correlation · Expectation via indicator variables and the 'story' proofs (the Stat 110 superpower) · Conditional expectation E[X|Y] as a random variable; the tower property / law of iterated expectation · Moment generating functions; sums of random variables; convolution · Concentration & limit theorems: Markov, Chebyshev, the Law of Large Numbers, the Central Limit Theorem, a first look at Chernoff/Hoeffding bounds · Gambler's ruin, the matching/birthday/Monty Hall classics — and why they trip up intuition

Read — Introduction to Probability — Blitzstein & Hwang (free full PDF) · Stat 110 — Strategic Practice & Homework Problems (with solutions) · Mathematics for Machine Learning — Ch. 6 (Probability & Distributions) · Fifty Challenging Problems in Probability — Mosteller (Dover)

Watch — Harvard Stat 110 — Joe Blitzstein full lecture series · 3Blue1Brown — Probability series (Bayes, binomial, Central Limit Theorem) · MIT 6.041SC — Probabilistic Systems Analysis (Tsitsiklis)

Problems

hard Stat 110 Strategic Practice — all sets (conditioning, expectation via indicators, MGFs, conditional expectation) — The core grind of this stage. Many problems are genuinely hard and reward the 'story proof' technique. Do them without peeking; the solutions teach a style of thinking, not just answers.
brutal All 50 of Mosteller's Fifty Challenging Problems — Each is a compact brain teaser at exactly Jane-Street-interview difficulty. The 'Prisoner's Dilemma', 'Twin Knights', and ballot-problem entries are legendary. Solving all 50 unaided is a concrete, brutal milestone.
brutal Putnam probability/combinatorics problems (expectation, pigeonhole, generating functions, symmetry) — Putnam problems that hinge on a clever expectation argument or symmetry are the purest form of 'aha' probability. Brutal, but the payoff in intuition is enormous. Filter for the combinatorics/probability-flavored ones.
hard Project Euler probability-heavy problems (#84 Monopoly, #121, #213, #227, #347) — Forces you to set up Markov-chain / expected-value computations correctly AND compute them. #84 (Monopoly odds) is a perfect bridge into Stage 4's Markov chains.

Done when — Build a small Bayesian inference engine from scratch: implement conjugate-prior updating (Beta-Binomial and Normal-Normal) and a basic Metropolis-Hastings sampler, then use them to recover the parameters of a distribution you generated. Separately: solve at least 40 of Mosteller's 50 unaided, and write up three 'story proofs' (e.g. derive E[X] for the geometric and the expected number of fixed points of a random permutation) in your own words.

Stage 4 — Statistics & Stochastic Processes: From Data to Dynamics · 12–14 weeks weeks
Move from static probability to inference and processes that evolve in time: estimation, hypothesis testing, Markov chains, martingales, and a first taste of measure-theoretic rigor — the exact toolkit of quant modeling and statistical ML.

Concepts, resources and problems

Concepts — Estimation: method of moments, maximum likelihood (MLE), bias/variance, consistency, efficiency · The Cramér–Rao bound and Fisher information; what 'best estimator' means · Confidence intervals and the bootstrap; the sampling distribution · Hypothesis testing: null/alternative, p-values, type I/II error, power, the Neyman–Pearson lemma · Bayesian vs frequentist inference; priors, posteriors, credible vs confidence intervals · Markov chains: transition matrices, stationary distributions, ergodicity, mixing; PageRank as an eigenvector · Poisson processes and continuous-time Markov chains · Martingales: the defining property, the optional stopping theorem, and why they model fair games / no-arbitrage · Random walks, gambler's ruin revisited, hitting times · Brownian motion as the limit of random walks; a first look at Itô calculus and the role of stochastic calculus in pricing · A taste of measure-theoretic probability: why we need σ-algebras, the difference between 'almost surely' and 'surely', and what a measurable function is

Read — Mathematics for Machine Learning — MLE/MAP & statistics sections (Ch. 6, 8–9) · MIT 6.262 — Discrete Stochastic Processes (Gallager, full materials) · All of Statistics — Larry Wasserman (free CMU PDF) · Probability: Theory and Examples — Rick Durrett (free PDF, Duke) · Stochastic Calculus for Finance II: Continuous-Time Models — Steven Shreve

Watch — MIT 6.262 — Discrete Stochastic Processes lectures (Robert Gallager) · Steve Brunton — Eigenvalues/eigenvectors & PageRank (Markov-chain connection) · Ben Lambert — Maximum Likelihood & hypothesis-testing econometrics playlist

Problems

brutal MIT 6.262 problem sets (Markov chains, Poisson processes, martingales) — Graduate-level and genuinely hard. Computing hitting times, proving recurrence, and applying optional stopping to martingales are exactly the problems quant interviews escalate to.
brutal Optional Stopping Theorem applications — gambler's ruin, ABRACADABRA, ballot problems — The 'expected time until the monkey types ABRACADABRA' problem solved via martingales is a thing of beauty and a famous interview filter. Master a handful of these and you can solve a huge class of 'expected number of steps' teasers instantly.
hard Derive the Neyman–Pearson lemma and the Cramér–Rao bound from scratch — Proof projects that force you to actually understand what 'most powerful test' and 'minimum-variance unbiased estimator' mean, rather than reciting them. Core statistical rigor.
brutal Jane Street monthly puzzles + Project Euler Markov problems (#84, #227, #305) — Many Jane Street puzzles reduce to a cleverly-set-up Markov chain or expected-value/martingale argument. This is where everything in the path converges on the exact problem style you asked to be challenged by. Start with older months (solutions posted).

Done when — Build a Markov-chain Monte Carlo simulator AND a Markov-chain text generator from scratch: train transition probabilities on a real corpus to generate text, and use MCMC (Metropolis-Hastings) to estimate a quantity (e.g. π, or a Bayesian posterior) — then verify your simulated stationary distribution matches the analytic dominant eigenvector. Separately: implement MLE for a model from scratch and confirm it against the analytic estimator, and solve at least one full Jane Street puzzle end-to-end.

Projects

PCA / SVD from scratch (the 'see the math' project) — Implement PCA and the SVD from first principles in NumPy — no np.linalg.svd, no sklearn. Compute the eigendecomposition of the covariance matrix yourself (power iteration or the QR algorithm), then use it to (a) compress an image to k singular values and plot reconstruction error vs k, and (b) reduce a real dataset (e.g. MNIST or Iris) to 2D and visualize it. Validate your output against the library functions to within numerical tolerance.
micrograd-style autograd engine + tiny neural net — Build a scalar-valued automatic differentiation engine: a Value class that builds a computation graph and backpropagates gradients via the multivariable chain rule. Then build a small MLP on top of it, train it on a real classification task, and verify every gradient against finite differences. Use Karpathy's micrograd as a reference only after you've attempted it yourself.
Bayesian inference engine — Build a small library for Bayesian inference from scratch: conjugate-prior updates (Beta-Binomial, Normal-Normal, Gamma-Poisson) with closed-form posteriors, PLUS a general-purpose Metropolis-Hastings sampler for when there's no conjugacy. Demonstrate it by recovering parameters of distributions you generated, and by running a Bayesian A/B test on simulated data. Visualize how the posterior tightens as data arrives.
Markov-chain & Monte-Carlo simulator suite — A combined project tying Stages 3+4 together: (1) a Markov-chain text generator trained on a real corpus; (2) a Monte-Carlo engine that estimates hard-to-compute quantities (π, the value of a probability brain teaser, a Project Euler #84-style game); (3) a Markov-chain analyzer that computes stationary distributions both by simulation and by finding the dominant eigenvector, proving they agree. Reproduce the Monopoly-odds and gambler's-ruin classics numerically.
FLAGSHIP: Reproduce a foundational paper's math end-to-end — Pick one mathematically rich, reproducible paper and rebuild it from the equations up, using only the tools you've built and the math you've internalized. Strong candidates: the original PageRank paper (linear algebra + Markov chains), the Eckart–Young / latent-semantic-analysis low-rank theory, a Gaussian-process regression from scratch (linear algebra + probability + optimization), or a Black-Scholes derivation with Monte-Carlo validation. Write a clean technical report deriving every equation and showing your implementation matches the paper's results.

Going harder

Hard problem arena — 7 brutal problems

brutal Jane Street Monthly Puzzles — The defining 'puzzle energy' you asked for. Monthly problems from a top trading firm that almost always reduce to a clever probability, expected-value, optimization, or Markov-chain argument with a computational twist. The archive holds years of them with solutions. Start with older months (solutions available), then race the current month live before the deadline.
legendary The Putnam Competition Archive (Kedlaya) — The hardest undergraduate math competition in North America. Every year has 12 problems; the median score is famously near zero (often 0–1 out of 120). The probability, combinatorics, linear-algebra, and inequality/optimization problems are directly relevant — and solving even a few unaided is a real badge. Full solutions and commentary by Bhargava, Kedlaya & Ng included.
brutal Project Euler — hard tier (difficulty 70%+) — 900+ problems where the hardest demand combining several advanced ideas AND figuring out what NOT to compute. The 70%+ difficulty problems often need number theory, generating functions, clever DP, or Markov-chain setups, all under a ~1-minute CPU budget. The 'find the pattern, prove it, then compute it efficiently' loop is pure ML/quant training.
hard Fifty Challenging Problems in Probability (Mosteller) — Fifty compact, brutal probability brain teasers that recur in quant interviews to this day. No computer needed — pure insight. Solving all fifty unaided is a classic informal qualification for aspiring quants.
hard Quant interview problem books — the 'Green Book' (Zhou) and 'Heard on the Street' (Crack) — The two canonical quant-interview brain-teaser collections. The Green Book's brain-teaser, probability, and stochastic-process sections (200+ real problems) are the gold standard for Jane Street / Citadel / Optiver / SIG / HRT-style questions. If the goal is to be tested like a quant, these are the proving grounds.
hard Brilliant / 'Quant Guide' interview banks + xkcd-style estimathons — Curated, tagged, difficulty-rated banks of the exact probability / brain-teaser / market-making questions trading firms ask, with a leaderboard. Brainstellar (brainstellar.com) is a free, beloved companion. Grind the 'hard' and 'very hard' tiers under a timer to simulate live interview pressure.
legendary Paper reproductions (the research-grade arena) — The ultimate hard problem isn't a puzzle with a known answer — it's taking a dense paper (a Gaussian-process regression, a diffusion model's math, an optimal-transport result) and reproducing every equation and result from scratch. This is where 'I learned the math' becomes 'I can do the work that produces the math.' The hardest and most rewarding tier.

Keep curious

Blogs, people, communities, rabbit holes

BLOGS & WRITERS — Terence Tao's blog (terrytao.wordpress.com): the world's most famous mathematician thinking out loud, including his 'career advice' and problem-solving posts. Gregory Gundersen (gregorygundersen.com/blog): exquisitely clear from-scratch derivations of ML math (PCA, Gaussian processes, the reparameterization trick, OLS, Black-Scholes). Cosma Shalizi's notebooks (bactra.org/notebooks): an opinionated, deep map of statistics and stochastic processes. Count Bayesie (countbayesie.com) for probability intuition. Christopher Olah (colah.github.io) for the geometry of deep learning.
NEWSLETTERS & FEEDS — For quant: Wilmott (wilmott.com) for practitioner-level discussion, and the QuantStart blog (quantstart.com) for hands-on derivations. For ML math: the Distill.pub archive (distill.pub) and The Gradient (thegradient.pub). Track new work via arxiv-sanity-lite (arxiv-sanity-lite.com) filtered to stat.ML and q-fin, and Import AI / The Batch (deeplearning.ai) for weekly orientation.
YOUTUBE TO LIVE IN — 3Blue1Brown (intuition), Steve Brunton / 'Eigensteve' (data-driven math), MIT OCW (Strang, Tsitsiklis, Gallager), Mathologer (deep dives), and Numberphile (rabbit-hole fuel). For the proof-writing and Putnam mindset, Michael Penn grinds olympiad/Putnam problems almost daily, and blackpenredpen is great for hard integrals/calculus.
COMMUNITIES — r/math and r/learnmath for general math; r/quant and r/quantfinance for the trading side; QuantNet (quantnet.com) is THE forum for aspiring quants with program reviews and interview threads. The Art of Problem Solving community (artofproblemsolving.com/community) is where competition-math culture lives — invaluable for Putnam prep. Math Stack Exchange (math.stackexchange.com) and Cross Validated (stats.stackexchange.com) for getting unstuck with rigor.
COMPETITIONS & RECURRING CHALLENGES — Jane Street puzzles (monthly) and their Estimathon / electronic-trading games; the Putnam (first Saturday of December, register through your nearest university); Project Euler (self-paced, lifelong); QuantGuide.io and Brainstellar for tagged interview problems; Kaggle for applied ML. Many firms (Jane Street, Optiver, IMC, SIG) run public trading-game and puzzle events — watch their sites.
CONFERENCES & TALKS — NeurIPS, ICML, and ICLR (watch the free recorded talks and tutorials even without attending) for ML; the Joint Mathematics Meetings for pure-math culture. For quant: QuantMinds International and the Quant Conference, plus practitioner talks. Most post recordings online — the tutorial sessions are gold for seeing math applied at the frontier.
FRONTIER RABBIT HOLES (if this clicks, go deeper here) — Optimal transport and Wasserstein distances (the math behind diffusion models and modern generative AI); information geometry (treating distributions as a curved manifold, the natural gradient); the neural tangent kernel and why overparameterized nets generalize; rough-path theory and stochastic calculus for high-frequency finance; random matrix theory (the spectral statistics that govern both deep nets and portfolio covariance — Marchenko–Pastur, Tracy–Widom). Each is a multi-month obsession that pays off in both ML and quant.
BOOKS FOR THE LONG GAME — after this path: 'The Elements of Statistical Learning' (Hastie/Tibshirani/Friedman, free PDF at hastie.su.domains/ElemStatLearn) for statistical ML; 'Pattern Recognition and Machine Learning' (Bishop, free PDF from Microsoft Research) for the Bayesian view; 'Stochastic Calculus for Finance I & II' (Shreve) for the quant track; 'Probability: Theory and Examples' (Durrett, free) for full measure-theoretic rigor; 'Numerical Linear Algebra' (Trefethen & Bau) and 'Convex Optimization' (Boyd, free) to become genuinely dangerous with matrices and optimizers.
WHAT TO DO AFTER — fork the path: for ML, go into deep-learning theory, Gaussian processes, and optimization; for quant, into stochastic calculus, derivatives pricing, and time-series. Either way, the move that compounds is teaching: write up your derivations as blog posts. Explaining the SVD, the optional stopping theorem, or the Black-Scholes derivation to strangers is the fastest way to discover the gaps in your own understanding — and to build a public track record that opens doors.

How you'll know you've actually got it

You can derive backpropagation for an arbitrary feedforward network on a whiteboard, from the multivariable chain rule, with no reference — and explain why reverse-mode autodiff is the efficient way to do it.
Given any matrix, you can immediately describe its four fundamental subspaces, sketch what its SVD does geometrically, and explain why the truncated SVD is the optimal low-rank approximation (the Eckart–Young theorem) without looking it up.
You read a paper's equations — a posterior update, a loss derivation, a pricing formula — and can re-derive and implement them yourself, then validate your code against the paper's numbers.
You solve probability brain teasers (Mosteller-tier, Jane-Street-tier) by recognizing the right framing — indicator variables, conditioning, a martingale, a symmetry argument — within minutes rather than grinding casework.
You can set up and solve a constrained optimization with Lagrange multipliers / KKT conditions, and prove whether a given function is convex, as a reflex.
You instinctively reach for the right distribution given a 'story' (this is Poisson because…, this is a Beta because it's the conjugate prior for…) and can prove its expectation and variance from scratch.
You've solved at least one full Jane Street puzzle end-to-end and a handful of Putnam problems unaided — and you can write up your solution as a clean, rigorous proof, not just an answer.
You can model a real process as a Markov chain or martingale, compute its stationary distribution or expected hitting time, and apply the optional stopping theorem correctly.
Your from-scratch implementations (PCA, autograd, Bayesian sampler, MCMC, Markov-chain simulator) match the canonical libraries to numerical tolerance — proving you understand the math beneath the API, not just the API.
You can teach any of these topics to another engineer clearly, and you catch your own errors because you understand why each step holds, not just that it produces the expected number.

← all roadmaps · back to hub