TorchedUp
ProblemsPremium
TorchedUp
Debug: Gradient AccumulationMedium
ProblemsPremium

Debug: Gradient Accumulation

Gradient accumulation simulates a large effective batch size by accumulating gradients over K micro-batches before updating weights. There is a common off-by-one bug in how the final loss is normalized.

Signature: def buggy_grad_accumulation(losses, accumulation_steps)

  • losses: flat list of per-sample losses, length = accumulation_steps * batch_size
  • accumulation_steps: K, the number of micro-batches to accumulate
  • Returns: scalar — the correctly normalized mean loss over all samples

The bug: The code averages each micro-batch independently, then sums those averages — but never divides by K. This gives a result K times larger than the correct global mean, which inflates the effective learning rate by a factor of K.

The fix: After accumulating the K micro-batch means, divide by accumulation_steps to get the true mean over all K * batch_size samples.

Math

Asked at

Python (numpy)0/3 runs today

Test Results

○6 losses, 3 accumulation steps (buggy returns 10.5, correct is 3.5)
○4 losses, 2 accumulation steps
○3 losses, 1 accumulation step (sum and mean are equal — sanity check)🔒 Premium
Advertisement