TorchedUp
ProblemsPremium
TorchedUp
MoE Expert RoutingMedium
ProblemsPremium

MoE Expert Routing

Mixture of Experts (MoE) enables sparse computation: instead of passing every token through every FFN layer, each token is routed to the top-K most relevant expert networks. This scales model capacity without proportionally scaling compute.

Signature: def moe_routing(x, W_gate, top_k)

  • x: numpy array of shape (num_tokens, d_model)
  • W_gate: numpy array of shape (num_experts, d_model) — gating weight matrix
  • top_k: int — number of experts to select per token
  • Returns: tuple (expert_indices, expert_weights) both of shape (num_tokens, top_k)
    • expert_indices: int array of selected expert indices (sorted descending by weight)
    • expert_weights: float array of routing weights, normalized to sum to 1 per token

Algorithm:

  1. Compute gate logits: gate_logits = x @ W_gate.T
  2. Apply softmax over the expert dimension
  3. Select the top-K experts per token
  4. Renormalize the selected weights to sum to 1

Math

Asked at

Python (numpy)0/3 runs today

Test Results

○3 tokens, 4 experts, top-2
○top-1 greedy routing — weights all 1.0
○Uniform input — equal weights per selected expert🔒 Premium
Advertisement