Gradient Accumulation

Correct Gradient Accumulation

When a batch is too large for memory, you split it into accumulation_steps micro-batches, accumulate their gradients, and apply one optimizer step. To match the gradient of one big batch you must average — not just sum — the per-step gradients.

Signature: def accumulated_gradient(grads: list, accumulation_steps: int) -> list

grads is a list of accumulation_steps numpy arrays of identical shape. Return the elementwise mean as a numpy array (or nested list).

Math

Asked at