PyTorch: Simulated Data Parallel Gradient Averaging

Simulate the core gradient-averaging step of Distributed Data Parallel (DDP) using PyTorch.

Signature: def simulate_data_parallel(model_weights, data_shards, lr)

model_weights: list of floats (n_weights,) — shared initial weights
data_shards: list of lists, each (shard_size, n_weights) — input data per worker
lr: learning rate (float)
Returns: updated weights after one DDP step as a list

Algorithm:

For each worker/shard: create a local copy of weights with requires_grad=True
Forward pass: out = (x @ w_local).sum() (sum of dot products over the batch)
Call .backward() to compute w_local.grad
Average all workers' gradients: avg_grad = mean([g0, g1, ..., gK])
SGD update: w_new = w - lr * avg_grad
Return updated weights as a list

Why? In DDP, each GPU processes a different shard of data but holds identical model parameters. After each backward pass, gradients are averaged (all-reduce) across GPUs before the optimizer step — ensuring all replicas stay in sync.

Math

Asked at

PyTorch: Simulated Data Parallel Gradient Averaging

Simulate the core gradient-averaging step of Distributed Data Parallel (DDP) using PyTorch.

Signature: def simulate_data_parallel(model_weights, data_shards, lr)

model_weights: list of floats (n_weights,) — shared initial weights
data_shards: list of lists, each (shard_size, n_weights) — input data per worker
lr: learning rate (float)
Returns: updated weights after one DDP step as a list

Algorithm:

For each worker/shard: create a local copy of weights with requires_grad=True
Forward pass: out = (x @ w_local).sum() (sum of dot products over the batch)
Call .backward() to compute w_local.grad
Average all workers' gradients: avg_grad = mean([g0, g1, ..., gK])
SGD update: w_new = w - lr * avg_grad
Return updated weights as a list

Math

Asked at