Transformer Encoder Block

A full Transformer encoder block combines self-attention + FFN with Pre-LN residuals. Each sublayer is wrapped as:

x = x + sublayer(LayerNorm(x))

The block:

Signature: def transformer_encoder_block(x, Wq, Wk, Wv, Wo, W1, b1, W2, b2, gamma1, beta1, gamma2, beta2)

Use standard (non-causal) self-attention. GELU: 0.5 * x * (1 + erf(x / sqrt(2))). LayerNorm epsilon: 1e-5.

Math

Asked at

A full Transformer encoder block combines self-attention + FFN with Pre-LN residuals. Each sublayer is wrapped as:

x = x + sublayer(LayerNorm(x))

The block:

Signature: def transformer_encoder_block(x, Wq, Wk, Wv, Wo, W1, b1, W2, b2, gamma1, beta1, gamma2, beta2)

Use standard (non-causal) self-attention. GELU: 0.5 * x * (1 + erf(x / sqrt(2))). LayerNorm epsilon: 1e-5.

Math

Asked at