Gradient accumulation simulates a large effective batch size by accumulating gradients over K micro-batches before updating weights. There is a common off-by-one bug in how the final loss is normalized.
Signature: def buggy_grad_accumulation(losses, accumulation_steps)
losses: flat list of per-sample losses, length = accumulation_steps * batch_sizeaccumulation_steps: K, the number of micro-batches to accumulateThe bug: The code averages each micro-batch independently, then sums those averages — but never divides by K. This gives a result K times larger than the correct global mean, which inflates the effective learning rate by a factor of K.
The fix: After accumulating the K micro-batch means, divide by accumulation_steps to get the true mean over all K * batch_size samples.
Math
Asked at
Test Results