TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

195. Megatron Sequence Parallelism

Hard

Implement sequence parallelism from Megatron-LM v3: shard the sequence dimension across ranks for non-tensor-parallel regions (LayerNorm, Dropout), reducing activation memory.

Signature: def sequence_parallel_layernorm(x_shards, gamma, beta)

  • x_shards: list of world_size arrays, each (S//world_size, d) — sequence shards
  • gamma: (d,) — LayerNorm scale
  • beta: (d,) — LayerNorm bias
  • Returns: list of world_size arrays — normalized shards (same structure as input)

Algorithm

Each rank independently applies LayerNorm to its sequence shard. Since LayerNorm normalizes over the feature dimension (not sequence), no communication is needed.

for rank in range(world_size):
    x_shard = x_shards[rank]       # (S//world, d)
    mu = x_shard.mean(axis=-1, keepdims=True)   # per-token mean
    var = x_shard.var(axis=-1, keepdims=True)   # per-token variance
    out_shard = gamma * (x_shard - mu) / sqrt(var + 1e-5) + beta

Sequence Parallel + Tensor Parallel Integration

In Megatron's full setup:

  • Tensor parallel regions (attention/FFN): features split across ranks
  • Sequence parallel regions (LayerNorm/Dropout): sequence split across ranks
  • They alternate with all-gather / reduce-scatter to switch modes

This problem focuses on the LayerNorm portion of sequence parallelism.

Asked at

NumPy

import numpy as np

 

def sequence_parallel_layernorm(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?