Implement Expert Parallelism for MoE models: each rank hosts a subset of experts, and tokens are dispatched to the correct rank via all-to-all communication.
Signature: def expert_parallel_dispatch(tokens, routing, expert_Ws, rank, world_size)
tokens: (N, d) — all tokens (available on all ranks before dispatch)routing: (N,) — expert index assigned to each token (0 to n_experts-1)expert_Ws: list of weight matrices for experts on this rank, each (d, d)rank: this rank's index (0-based)world_size: total number of ranks (each rank hosts n_experts//world_size experts)(N, d) — output tokens for all tokens processed by this rank's experts (others are zero)Each rank owns a contiguous slice of experts: rank r owns global expert indices [r · experts_per_rank, (r+1) · experts_per_rank) where experts_per_rank = n_experts // world_size. Note that expert_Ws is the local-only list (length experts_per_rank), so when a token is routed to a global expert this rank owns, you must offset the global index into the local list.
Initialize the output to zeros. For every token, look up its routed global expert index; if that index falls in this rank's slice, run the token through the corresponding local expert (a single linear layer) and write the result to that token's row. Tokens routed to experts owned by other ranks contribute nothing to this rank's output (they would be handled by their owning rank in a real system).
In practice, an all-to-all collective sends each token to its target rank. Here we simulate by processing only the tokens assigned to this rank's experts.
With 64 experts spread across 8 GPUs (8 experts/GPU), each GPU stores only 1/8 of the expert parameters. Communication happens once per MoE layer (all-to-all for tokens), enabling massive MoE scaling.
Asked at
import numpy as np
def expert_parallel_dispatch(...):
pass
Premium problem
Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.
Already premium?