TorchedUp
ProblemsPremium
TorchedUp
Data Parallelism: Gradient AveragingEasy
ProblemsPremium

Data Parallelism: Gradient Averaging

In Distributed Data Parallel (DDP) training each worker processes a different micro-batch, computes local gradients independently, then averages those gradients across all workers before the optimizer step. This is mathematically equivalent to computing gradients on the full concatenated batch.

Implement one DDP-style SGD step:

  1. Average the per-worker gradients.
  2. Update parameters with vanilla SGD: params = params - lr * avg_grad.

Signature: def data_parallel_step(params, worker_grads, lr)

  • params: 1-D array of shape (param_size,) — current model parameters
  • worker_grads: 2-D array of shape (num_workers, param_size) — each row is one worker's gradient
  • lr: float — learning rate
  • Returns: 1-D array of shape (param_size,) — updated parameters

Math

Asked at

Python (numpy)0/3 runs today

Test Results

○2 workers, 3 params
○4 workers, cancelling gradients
○single worker equals plain SGD🔒 Premium
Advertisement