Debug: Gradient Accumulation

Gradient accumulation simulates a large effective batch size by accumulating gradients over K micro-batches before updating weights. There is a common off-by-one bug in how the final loss is normalized.

Signature: def buggy_grad_accumulation(losses, accumulation_steps)

losses: flat list of per-sample losses, length = accumulation_steps * batch_size
accumulation_steps: K, the number of micro-batches to accumulate
Returns: scalar — the correctly normalized mean loss over all samples

The bug: The code averages each micro-batch independently, then sums those averages — but never divides by K. This gives a result K times larger than the correct global mean, which inflates the effective learning rate by a factor of K.

The fix: After accumulating the K micro-batch means, divide by accumulation_steps to get the true mean over all K * batch_size samples.

Math

Asked at