Implement MoE routing with capacity factor: each expert can process at most capacity tokens, and overflow tokens are dropped (zero output).
Signature: def moe_capacity_routing(gate_probs, top_k, capacity_factor)
gate_probs: (N, n_experts) — softmax gating probabilitiestop_k: number of experts per tokencapacity_factor: float, multiplier for ideal load(N, n_experts) binary assignment matrix (1 = token processed by expert)import math
capacity = math.ceil(capacity_factor * N / n_experts)
# For each token, try to assign to its top-k experts (highest prob first)
# If expert is full (count >= capacity), skip (token is dropped for that expert)
assignment = zeros(N, n_experts)
expert_counts = zeros(n_experts)
for each token i:
preferred_experts = argsort(gate_probs[i])[-top_k:] # top-k, highest prob last
for e in reversed(preferred_experts): # process highest prob first
if expert_counts[e] < capacity:
assignment[i, e] = 1
expert_counts[e] += 1
capacity_factor = 1.0: each expert processes exactly N/E tokens on average. Tight — many tokens dropped under imbalance.capacity_factor = 1.25: 25% slack. Common in Mixtral/Switch Transformer.capacity_factor = 2.0: generous — rarely drops tokens but wastes compute.Token dropping introduces bias during training (dropped tokens get zero gradient for that expert) but enables fixed-size batches needed for efficient GPU kernels.
Asked at
Test Results