When a batch is too large for memory, you split it into accumulation_steps micro-batches, accumulate their gradients, and apply one optimizer step. To match the gradient of one big batch you must average — not just sum — the per-step gradients.
Signature: def accumulated_gradient(grads: list, accumulation_steps: int) -> list
grads is a list of accumulation_steps numpy arrays of identical shape. Return the elementwise mean as a numpy array (or nested list).
Math
Asked at
Test Results