Transformer MLP Block (PyTorch)

Implement the FFN sublayer used in every Transformer block, in PyTorch with primitive tensor ops only.

FFN(x) = LayerNorm(x + W2 * GELU(W1 * x + b1) + b2)

Signature: def transformer_mlp(x, W1, b1, W2, b2, gamma, beta) -> torch.Tensor

The rule: you may NOT call nn.LayerNorm, F.layer_norm, or F.gelu. Hand-roll both LN (mean/var) and GELU (the exact erf form below).

Use the exact GELU: 0.5 * h * (1 + erf(h / sqrt(2))). Do not use the tanh approximation — expected outputs are computed with exact GELU.

PyTorch idioms:

torch.erf(h / 2.0**0.5) is the exact-form GELU. F.gelu(h, approximate='none') matches this; F.gelu(h, approximate='tanh') does not.
x.var(dim=-1, keepdim=True, unbiased=False) for population variance.
Matmul @ broadcasts naturally over leading dims, so the same code handles (d,), (B, d), and (B, T, d).

Math

FFN (x) = LayerNorm (x + W_{2} GELU (W_{1} x + b_{1}) + b_{2})

Related problems

Asked at