TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

25. Tensor Parallelism (Megatron-LM)

Hard

Megatron-LM splits individual linear layers across GPUs so that each GPU only stores and computes a fraction of the weight matrix. For a two-layer FFN block out = W2 · GeLU(W1 · x + b1) + b2:

  • Column-parallel W1 — split along the output (row) dimension: GPU i holds rows [i·chunk, (i+1)·chunk) of W1. Each GPU produces a partial hidden vector of size d_ff / N.
  • Row-parallel W2 — split along the input (column) dimension: GPU i holds columns [i·chunk, (i+1)·chunk) of W2. Each GPU computes a partial output; an all-reduce sums them.

The bias b2 is added only once after the all-reduce.

Signature: def tensor_parallel_linear(x, W1, b1, W2, b2, num_gpus=2)

  • x: 1-D array (d_model,)
  • W1: 2-D array (d_ff, d_model), b1: 1-D (d_ff,)
  • W2: 2-D array (d_model, d_ff), b2: 1-D (d_model,)
  • num_gpus: int — number of simulated GPUs (d_ff must be divisible by num_gpus)
  • Returns: 1-D array (d_model,)

GeLU definition: gelu(x) = 0.5 * x * (1 + erf(x / sqrt(2)))

Math

out=all-reduce sumi=0∑N−1​W2(i)​⋅GeLU(W1(i)​x+b1(i)​)​​+b2​

Asked at

NumPy

import numpy as np

 

def tensor_parallel_linear(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?