Megatron Sequence Parallelism

Implement sequence parallelism from Megatron-LM v3: shard the sequence dimension across ranks for non-tensor-parallel regions (LayerNorm, Dropout), reducing activation memory.

Signature: def sequence_parallel_layernorm(x_shards, gamma, beta)

x_shards: list of world_size arrays, each (S//world_size, d) — sequence shards
gamma: (d,) — LayerNorm scale
beta: (d,) — LayerNorm bias
Returns: list of world_size arrays — normalized shards (same structure as input)

Algorithm

Each rank independently applies LayerNorm to its sequence shard. Since LayerNorm normalizes over the feature dimension (not sequence), no communication is needed.

for rank in range(world_size):
    x_shard = x_shards[rank]       # (S//world, d)
    mu = x_shard.mean(axis=-1, keepdims=True)   # per-token mean
    var = x_shard.var(axis=-1, keepdims=True)   # per-token variance
    out_shard = gamma * (x_shard - mu) / sqrt(var + 1e-5) + beta

Sequence Parallel + Tensor Parallel Integration

In Megatron's full setup:

Tensor parallel regions (attention/FFN): features split across ranks
Sequence parallel regions (LayerNorm/Dropout): sequence split across ranks
They alternate with all-gather / reduce-scatter to switch modes

This problem focuses on the LayerNorm portion of sequence parallelism.

Asked at