Implement the FFN (feed-forward network) sublayer used in every Transformer layer.
It applies two linear layers with a GELU activation, adds a residual connection, and normalizes with LayerNorm:
FFN(x) = LayerNorm(x + W2 * GELU(W1 * x + b1) + b2)
Signature: def transformer_mlp(x, W1, b1, W2, b2, gamma, beta)
x: (d_model,) — input vectorW1: (d_ff, d_model), b1: (d_ff,) — first linear layerW2: (d_model, d_ff), b2: (d_model,) — second linear layergamma, beta: (d_model,) — LayerNorm scale and shift(d_model,)Use the exact GELU formula: 0.5 * h * (1 + erf(h / sqrt(2))).
LayerNorm epsilon: 1e-5.
Math
Asked at
Test Results