TorchedUp
ProblemsPremium
TorchedUp
MoE with Capacity FactorHard
ProblemsPremium

MoE with Capacity Factor

Implement MoE routing with capacity factor: each expert can process at most capacity tokens, and overflow tokens are dropped (zero output).

Signature: def moe_capacity_routing(gate_probs, top_k, capacity_factor)

  • gate_probs: (N, n_experts) — softmax gating probabilities
  • top_k: number of experts per token
  • capacity_factor: float, multiplier for ideal load
  • Returns: (N, n_experts) binary assignment matrix (1 = token processed by expert)

Algorithm

import math
capacity = math.ceil(capacity_factor * N / n_experts)

# For each token, try to assign to its top-k experts (highest prob first)
# If expert is full (count >= capacity), skip (token is dropped for that expert)

assignment = zeros(N, n_experts)
expert_counts = zeros(n_experts)

for each token i:
    preferred_experts = argsort(gate_probs[i])[-top_k:]  # top-k, highest prob last
    for e in reversed(preferred_experts):  # process highest prob first
        if expert_counts[e] < capacity:
            assignment[i, e] = 1
            expert_counts[e] += 1

Capacity Factor Tradeoffs

  • capacity_factor = 1.0: each expert processes exactly N/E tokens on average. Tight — many tokens dropped under imbalance.
  • capacity_factor = 1.25: 25% slack. Common in Mixtral/Switch Transformer.
  • capacity_factor = 2.0: generous — rarely drops tokens but wastes compute.

Token dropping introduces bias during training (dropped tokens get zero gradient for that expert) but enables fixed-size batches needed for efficient GPU kernels.

Asked at

Python (numpy)0/3 runs today

Test Results

○seed=42 probs, N=4, 2 experts, top_k=1, capacity_factor=1.0 (capacity=2)
○capacity_factor=0.5: capacity=1, third token to expert 0 is dropped
○top_k=2, capacity_factor=2.0: each token gets 2 experts (capacity=2)🔒 Premium
Advertisement