TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

91. MoE Forward Pass (Sparse Dispatch)

Hard

Implement the forward pass of a Mixture of Experts (MoE) layer as used in Mixtral, GPT-4, and DeepSeek-V3. Each token is routed to the top-k experts, their outputs are weighted-summed.

Signature: def moe_forward(x, gate_W, expert_Ws, top_k=2)

  • x: (N, d) — N tokens, each of dimension d
  • gate_W: (n_experts, d) — gating network weights
  • expert_Ws: list of n_experts weight matrices, each (d, d) — each expert is a single linear layer
  • top_k: number of experts per token
  • Returns: (N, d) — output tokens

Algorithm

The MoE layer is four conceptual steps:

  1. Gate scoring. Project each token through gate_W and softmax over the expert axis to get a per-token distribution over n_experts.
  2. Top-k selection. For each token, pick the top_k experts with highest gate probability.
  3. Renormalize. Divide the selected probabilities by their sum so the chosen top_k weights sum to 1 per token.
  4. Sparse dispatch + weighted combine. Run each chosen expert on its assigned tokens (each expert here is a single linear layer — see below) and combine the per-expert outputs using the renormalized weights to produce the per-token output.

Expert Linear Layer

Each expert in this problem is a simple linear transformation: expert_out = x[i] @ expert_W.T

Real MoE experts (Mixtral) use FFN (two-layer MLP), but the dispatch/routing logic is identical.

Asked at

NumPy

import numpy as np

 

def moe_forward(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?