TorchedUp
ProblemsPremium
TorchedUp
Tensor Parallelism (Megatron-LM)Hard
ProblemsPremium

Tensor Parallelism (Megatron-LM)

Megatron-LM splits individual linear layers across GPUs so that each GPU only stores and computes a fraction of the weight matrix. For a two-layer FFN block out = W2 · GeLU(W1 · x + b1) + b2:

  • Column-parallel W1 — split along the output (row) dimension: GPU i holds rows [i·chunk, (i+1)·chunk) of W1. Each GPU produces a partial hidden vector of size d_ff / N.
  • Row-parallel W2 — split along the input (column) dimension: GPU i holds columns [i·chunk, (i+1)·chunk) of W2. Each GPU computes a partial output; an all-reduce sums them.

The bias b2 is added only once after the all-reduce.

Signature: def tensor_parallel_linear(x, W1, b1, W2, b2, num_gpus=2)

  • x: 1-D array (d_model,)
  • W1: 2-D array (d_ff, d_model), b1: 1-D (d_ff,)
  • W2: 2-D array (d_model, d_ff), b2: 1-D (d_model,)
  • num_gpus: int — number of simulated GPUs (d_ff must be divisible by num_gpus)
  • Returns: 1-D array (d_model,)

GeLU definition: gelu(x) = 0.5 * x * (1 + erf(x / sqrt(2)))

Math

Asked at

Python (numpy)0/3 runs today

Test Results

○d_model=4, d_ff=8, num_gpus=2 (seed 42)
○d_model=8, d_ff=16, num_gpus=4 (seed 7)
○d_model=4, d_ff=8, zero bias, num_gpus=2 (seed 0)🔒 Premium
Advertisement