MoE with Capacity Factor

Implement MoE routing with capacity factor: each expert can process at most capacity tokens, and overflow tokens are dropped (zero output).

Signature: def moe_capacity_routing(gate_probs, top_k, capacity_factor)

gate_probs: (N, n_experts) — softmax gating probabilities
top_k: number of experts per token
capacity_factor: float, multiplier for ideal load
Returns: (N, n_experts) binary assignment matrix (1 = token processed by expert)

Algorithm

import math
capacity = math.ceil(capacity_factor * N / n_experts)

# For each token, try to assign to its top-k experts (highest prob first)
# If expert is full (count >= capacity), skip (token is dropped for that expert)

assignment = zeros(N, n_experts)
expert_counts = zeros(n_experts)

for each token i:
    preferred_experts = argsort(gate_probs[i])[-top_k:]  # top-k, highest prob last
    for e in reversed(preferred_experts):  # process highest prob first
        if expert_counts[e] < capacity:
            assignment[i, e] = 1
            expert_counts[e] += 1

Capacity Factor Tradeoffs

capacity_factor = 1.0: each expert processes exactly N/E tokens on average. Tight — many tokens dropped under imbalance.
capacity_factor = 1.25: 25% slack. Common in Mixtral/Switch Transformer.
capacity_factor = 2.0: generous — rarely drops tokens but wastes compute.

Token dropping introduces bias during training (dropped tokens get zero gradient for that expert) but enables fixed-size batches needed for efficient GPU kernels.

Asked at