SwiGLU (Gated Linear Unit)

SwiGLU is the feed-forward network (FFN) activation used in LLaMA, PaLM, and Mistral. It replaces the standard FFN:

FFN(x) = W2 * ReLU(W1 * x)

with a gated variant:

SwiGLU-FFN(x) = W2 * (SiLU(W1 * x) ⊙ (W3 * x))

where ⊙ is elementwise multiplication and W3 is an extra learned gate projection.

In this problem you implement the gating operation (without the output projection W2):

swiglu(x, W1, W3) = SiLU(W1 @ x) * (W3 @ x)

Signature: def swiglu(x: np.ndarray, W1: np.ndarray, W3: np.ndarray) -> np.ndarray

Math

Asked at