SwiGLU is the feed-forward network (FFN) activation used in LLaMA, PaLM, and Mistral. It replaces the standard FFN:
FFN(x) = W2 * ReLU(W1 * x)
with a gated variant:
SwiGLU-FFN(x) = W2 * (SiLU(W1 * x) ⊙ (W3 * x))
where ⊙ is elementwise multiplication and W3 is an extra learned gate projection.
In this problem you implement the gating operation (without the output projection W2):
swiglu(x, W1, W3) = SiLU(W1 @ x) * (W3 @ x)
Signature: def swiglu(x: np.ndarray, W1: np.ndarray, W3: np.ndarray) -> np.ndarray
x: (d_model,) — input vectorW1: (d_ff, d_model) — gate projection weightW3: (d_ff, d_model) — up projection weight (the "gate")(d_ff,) — gated intermediate activationsMath
Asked at
Test Results