TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
Learn/Distributed Training & Memory Math
⚙

Distributed Training & Memory Math

Before you can train anything large, you have to know whether it fits. This track combines napkin math (memory budgets, throughput estimates) with the actual algorithms (DDP, ZeRO, FSDP) that make training scale beyond a single GPU. The interview question "can we train a 70B model on 8 H100s?" stops being intimidating once you've worked through these.

8 problems · suggested order

  1. ○1#109Transformer Parameter Counteasy
  2. ○2#110Weight Memory by dtypeeasy
  3. ○3#112Total Inference Memorymedium
  4. ○4#115Activation Memory (Transformer)medium
  5. ○5#24Data Parallelism: Gradient Averagingeasy
  6. ○6#78ZeRO Stage 1: Optimizer State Shardingmedium
  7. ○7#98ZeRO Stage 3 — Parameter Shardinghard
  8. ○8#108PyTorch: Simulated Data Parallel Gradient Averaginghard
Tracks are curated by hand. The order above is the suggested learning progression — feel free to skip around if you already know a topic.

© 2026 TorchedUp. All rights reserved.

ChangelogContact UsTerms of ServicePrivacy PolicyRefund Policy