In Distributed Data Parallel (DDP) training each worker processes a different micro-batch, computes local gradients independently, then averages those gradients across all workers before the optimizer step. This is mathematically equivalent to computing gradients on the full concatenated batch.
Implement one DDP-style SGD step:
params = params - lr * avg_grad.Signature: def data_parallel_step(params, worker_grads, lr)
params: 1-D array of shape (param_size,) — current model parametersworker_grads: 2-D array of shape (num_workers, param_size) — each row is one worker's gradientlr: float — learning rate(param_size,) — updated parametersMath
Asked at
Test Results