Implement the forward pass of a Mixture of Experts (MoE) layer as used in Mixtral, GPT-4, and DeepSeek-V3. Each token is routed to the top-k experts, their outputs are weighted-summed.
Signature: def moe_forward(x, gate_W, expert_Ws, top_k=2)
x: (N, d) — N tokens, each of dimension dgate_W: (n_experts, d) — gating network weightsexpert_Ws: list of n_experts weight matrices, each (d, d) — each expert is a single linear layertop_k: number of experts per token(N, d) — output tokensThe MoE layer is four conceptual steps:
gate_W and softmax over the expert axis to get a per-token distribution over n_experts.top_k experts with highest gate probability.top_k weights sum to 1 per token.Each expert in this problem is a simple linear transformation: expert_out = x[i] @ expert_W.T
Real MoE experts (Mixtral) use FFN (two-layer MLP), but the dispatch/routing logic is identical.
Asked at
import numpy as np
def moe_forward(...):
pass
Premium problem
Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.
Already premium?