Tensor Parallelism (Megatron-LM)

Megatron-LM splits individual linear layers across GPUs so that each GPU only stores and computes a fraction of the weight matrix. For a two-layer FFN block out = W2 · GeLU(W1 · x + b1) + b2:

Column-parallel W1 — split along the output (row) dimension: GPU i holds rows [i·chunk, (i+1)·chunk) of W1. Each GPU produces a partial hidden vector of size d_ff / N.
Row-parallel W2 — split along the input (column) dimension: GPU i holds columns [i·chunk, (i+1)·chunk) of W2. Each GPU computes a partial output; an all-reduce sums them.

The bias b2 is added only once after the all-reduce.

Signature: def tensor_parallel_linear(x, W1, b1, W2, b2, num_gpus=2)

x: 1-D array (d_model,)
W1: 2-D array (d_ff, d_model), b1: 1-D (d_ff,)
W2: 2-D array (d_model, d_ff), b2: 1-D (d_model,)
num_gpus: int — number of simulated GPUs (d_ff must be divisible by num_gpus)
Returns: 1-D array (d_model,)

GeLU definition: gelu(x) = 0.5 * x * (1 + erf(x / sqrt(2)))

Math

Asked at

Tensor Parallelism (Megatron-LM)

Megatron-LM splits individual linear layers across GPUs so that each GPU only stores and computes a fraction of the weight matrix. For a two-layer FFN block out = W2 · GeLU(W1 · x + b1) + b2:

Column-parallel W1 — split along the output (row) dimension: GPU i holds rows [i·chunk, (i+1)·chunk) of W1. Each GPU produces a partial hidden vector of size d_ff / N.
Row-parallel W2 — split along the input (column) dimension: GPU i holds columns [i·chunk, (i+1)·chunk) of W2. Each GPU computes a partial output; an all-reduce sums them.

The bias b2 is added only once after the all-reduce.

Signature: def tensor_parallel_linear(x, W1, b1, W2, b2, num_gpus=2)

x: 1-D array (d_model,)
W1: 2-D array (d_ff, d_model), b1: 1-D (d_ff,)
W2: 2-D array (d_model, d_ff), b2: 1-D (d_model,)
num_gpus: int — number of simulated GPUs (d_ff must be divisible by num_gpus)
Returns: 1-D array (d_model,)

GeLU definition: gelu(x) = 0.5 * x * (1 + erf(x / sqrt(2)))

Math

Asked at