Ring All-Reduce is how GPUs synchronize gradients in data-parallel training. N workers arrange themselves in a ring. The algorithm runs in two phases:
Reduce-scatter phase (N-1 steps): Each worker sends a chunk to its right neighbour and receives+accumulates a chunk from its left neighbour. After N-1 steps every worker holds the fully-reduced sum for exactly one chunk.
All-gather phase (N-1 steps): Each worker distributes its reduced chunk around the ring so that every worker ends up with the complete reduced vector.
The key insight: communication cost is 2(N-1)·D/N ≈ 2D — independent of N — making it bandwidth-optimal.
Signature: def ring_all_reduce(gradients)
gradients: 2-D array of shape (num_workers, vector_size) — each row is one worker's local gradient vector(num_workers, vector_size) — every row contains the element-wise sum of all input rowsMath
Asked at
Test Results