Transformer MLP Block

Implement the FFN (feed-forward network) sublayer used in every Transformer layer.

It applies two linear layers with a GELU activation, adds a residual connection, and normalizes with LayerNorm:

FFN(x) = LayerNorm(x + W2 * GELU(W1 * x + b1) + b2)

Signature: def transformer_mlp(x, W1, b1, W2, b2, gamma, beta)

Use the exact GELU formula: 0.5 * h * (1 + erf(h / sqrt(2))). LayerNorm epsilon: 1e-5.

Math

Asked at

Implement the FFN (feed-forward network) sublayer used in every Transformer layer.

It applies two linear layers with a GELU activation, adds a residual connection, and normalizes with LayerNorm:

FFN(x) = LayerNorm(x + W2 * GELU(W1 * x + b1) + b2)

Signature: def transformer_mlp(x, W1, b1, W2, b2, gamma, beta)

Use the exact GELU formula: 0.5 * h * (1 + erf(h / sqrt(2))). LayerNorm epsilon: 1e-5.

Math

Asked at