Pipeline Parallelism (GPipe)

Pipeline parallelism splits model layers across devices. GPipe's approach: split the mini-batch into M micro-batches and pipeline them through stages. Each stage processes one micro-batch at a time while the next stage processes the previous one.

Simulate GPipe forward: given K pipeline stages (each a single linear+tanh layer), process M micro-batches through all stages sequentially.

Signature: def pipeline_forward(micro_batches, stage_weights, stage_biases)

micro_batches: (M, batch_size, d) — M micro-batches of data
stage_weights: (K, d, d) — K stage weight matrices
stage_biases: (K, d) — K stage biases
Returns: (M, batch_size, d) — output for each micro-batch after all K stages

Each stage applies: h = tanh(h @ W_k.T + b_k)

Math

h_{k} = tanh (h_{k - 1} W_{k}^{⊤} + b_{k}), k = 1, \dots, K

Asked at

Simulate GPipe forward: given K pipeline stages (each a single linear+tanh layer), process M micro-batches through all stages sequentially.

Signature: def pipeline_forward(micro_batches, stage_weights, stage_biases)

micro_batches: (M, batch_size, d) — M micro-batches of data
stage_weights: (K, d, d) — K stage weight matrices
stage_biases: (K, d) — K stage biases
Returns: (M, batch_size, d) — output for each micro-batch after all K stages

Each stage applies: h = tanh(h @ W_k.T + b_k)

Math

h_{k} = tanh (h_{k - 1} W_{k}^{⊤} + b_{k}), k = 1, \dots, K

Asked at

80. Pipeline Parallelism (GPipe)

80. Pipeline Parallelism (GPipe)