Megatron-LM splits individual linear layers across GPUs so that each GPU only stores and computes a fraction of the weight matrix. For a two-layer FFN block out = W2 · GeLU(W1 · x + b1) + b2:
i holds rows [i·chunk, (i+1)·chunk) of W1. Each GPU produces a partial hidden vector of size d_ff / N.i holds columns [i·chunk, (i+1)·chunk) of W2. Each GPU computes a partial output; an all-reduce sums them.The bias b2 is added only once after the all-reduce.
Signature: def tensor_parallel_linear(x, W1, b1, W2, b2, num_gpus=2)
x: 1-D array (d_model,)W1: 2-D array (d_ff, d_model), b1: 1-D (d_ff,)W2: 2-D array (d_model, d_ff), b2: 1-D (d_model,)num_gpus: int — number of simulated GPUs (d_ff must be divisible by num_gpus)(d_model,)GeLU definition: gelu(x) = 0.5 * x * (1 + erf(x / sqrt(2)))
Math
Asked at
Test Results