Mixture of Experts (MoE) enables sparse computation: instead of passing every token through every FFN layer, each token is routed to the top-K most relevant expert networks. This scales model capacity without proportionally scaling compute.
Signature: def moe_routing(x, W_gate, top_k)
x: numpy array of shape (num_tokens, d_model)W_gate: numpy array of shape (num_experts, d_model) — gating weight matrixtop_k: int — number of experts to select per token(expert_indices, expert_weights) both of shape (num_tokens, top_k)
expert_indices: int array of selected expert indices (sorted descending by weight)expert_weights: float array of routing weights, normalized to sum to 1 per tokenAlgorithm:
gate_logits = x @ W_gate.TMath
Asked at
Test Results