TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium

From softmax to LLM internals

252hands-on coding problems on the math, transformer internals, inference systems, and distributed training that run today's frontier models — Flash Attention, paged KV caches, MoE routing, ZeRO, Megatron parallelism, and more. Build the ML engineering depth that compounds across a career, not just for the next interview.

Start practicing →Go Premium

252

coding problems

53

hard-difficulty

14

topic areas

Numerically Stable Softmax

EasyTry it now — no signup required

Implement softmax that handles large values without overflow. Hint: subtract max(x) before exp to prevent overflow.

Click Run to load Python 3.12

3 test cases will run when you click Run.

Solved this? 250+ more problems waiting →

What you'll practice

∑

Numpy fundamentals

Softmax, attention, backprop, Adam — the building blocks every MLE must know.

⚡

Transformer internals

RoPE, Flash Attention, GQA, MoE — implement the architectures powering GPT-4, LLaMA, and Gemini.

🔥

LLM inference & serving

KV cache, PagedAttention, speculative decoding, continuous batching — vLLM internals as coding problems.

⚙

Distributed systems

ZeRO, FSDP, Megatron parallelism, pipeline parallelism — what runs at the largest scale.

Coming soon

In development
Soon
🏗

ML System Design

Walk through the full design of recsys, ranking, search, fraud, ads pipelines — feature stores, training infra, deployment, monitoring. Lessons + interactive case studies.

Soon
⚡

Inference System Design

How to design and scale LLM serving — vLLM internals, paged attention, continuous batching, prefix caching, multi-LoRA, tensor parallel deployment, capacity planning.

Why TorchedUp

Math first, code second

Every problem starts from the formal math — softmax, attention, backprop, KL — and asks you to derive it in code. Memorizing PyTorch APIs won't help; understanding the equations will.

Verified correctness

Property tests check mathematical invariants (output sums to 1, gradient matches finite-difference), not just one numeric value. You can't fake your way through.

Foundations to systems

Numpy basics → attention variants → distributed training → inference optimization. The whole stack a senior MLE needs to know, in one place.

Ready to TorchedUp?

Free to start. Premium unlocks solutions, hints, and unlimited runs.

Browse all 252 problems →

© 2026 TorchedUp. All rights reserved.

ChangelogContact UsTerms of ServicePrivacy PolicyRefund Policy