MoE Expert Routing

Mixture of Experts (MoE) enables sparse computation: instead of passing every token through every FFN layer, each token is routed to the top-K most relevant expert networks. This scales model capacity without proportionally scaling compute.

Signature: def moe_routing(x, W_gate, top_k)

x: numpy array of shape (num_tokens, d_model)
W_gate: numpy array of shape (num_experts, d_model) — gating weight matrix
top_k: int — number of experts to select per token
Returns: tuple (expert_indices, expert_weights) both of shape (num_tokens, top_k)
- expert_indices: int array of selected expert indices (sorted descending by weight)
- expert_weights: float array of routing weights, normalized to sum to 1 per token

Algorithm:

Compute gate logits: gate_logits = x @ W_gate.T
Apply softmax over the expert dimension
Select the top-K experts per token
Renormalize the selected weights to sum to 1

Math

Asked at

MoE Expert Routing

Signature: def moe_routing(x, W_gate, top_k)

x: numpy array of shape (num_tokens, d_model)
W_gate: numpy array of shape (num_experts, d_model) — gating weight matrix
top_k: int — number of experts to select per token
Returns: tuple (expert_indices, expert_weights) both of shape (num_tokens, top_k)
- expert_indices: int array of selected expert indices (sorted descending by weight)
- expert_weights: float array of routing weights, normalized to sum to 1 per token

Algorithm:

Compute gate logits: gate_logits = x @ W_gate.T
Apply softmax over the expert dimension
Select the top-K experts per token
Renormalize the selected weights to sum to 1

Math

Asked at