TorchedUp
ProblemsPremium
TorchedUp
Ring All-ReduceMedium
ProblemsPremium

Ring All-Reduce

Ring All-Reduce is how GPUs synchronize gradients in data-parallel training. N workers arrange themselves in a ring. The algorithm runs in two phases:

Reduce-scatter phase (N-1 steps): Each worker sends a chunk to its right neighbour and receives+accumulates a chunk from its left neighbour. After N-1 steps every worker holds the fully-reduced sum for exactly one chunk.

All-gather phase (N-1 steps): Each worker distributes its reduced chunk around the ring so that every worker ends up with the complete reduced vector.

The key insight: communication cost is 2(N-1)·D/N ≈ 2D — independent of N — making it bandwidth-optimal.

Signature: def ring_all_reduce(gradients)

  • gradients: 2-D array of shape (num_workers, vector_size) — each row is one worker's local gradient vector
  • Returns: 2-D array of shape (num_workers, vector_size) — every row contains the element-wise sum of all input rows

Math

Asked at

Python (numpy)0/3 runs today

Test Results

○3 workers, vector size 2
○4 workers, uniform values
○single worker is a no-op🔒 Premium
Advertisement