TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

31. MoE Expert Routing

Medium

Mixture of Experts (MoE) enables sparse computation: instead of passing every token through every FFN layer, each token is routed to the top-K most relevant expert networks. This scales model capacity without proportionally scaling compute.

Signature: def moe_routing(x, W_gate, top_k)

  • x: numpy array of shape (num_tokens, d_model)
  • W_gate: numpy array of shape (num_experts, d_model) — gating weight matrix
  • top_k: int — number of experts to select per token
  • Returns: tuple (expert_indices, expert_weights) both of shape (num_tokens, top_k)
    • expert_indices: int array of selected expert indices (sorted descending by weight)
    • expert_weights: float array of routing weights, normalized to sum to 1 per token

Algorithm:

  1. Project each token through W_gate to produce one logit per expert.
  2. Apply a (numerically stable) softmax over the expert dimension to get a per-token distribution.
  3. For each token, pick the top-K experts by probability and emit them sorted descending by weight.
  4. Renormalize the K selected weights so they sum to 1 per token (matches what the dense MoE layer expects to combine expert outputs).

Math

gi​=softmax(xWgT​)i​TopK(g)={i:gi​≥k-th largest(g)}g~​i​=∑j∈TopK​gj​gi​​⋅1[i∈TopK]

Asked at

NumPy

import numpy as np

 

def moe_routing(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?