Transformer Decoder Block

A GPT-style Transformer decoder block has two sublayers (Pre-LayerNorm):

Masked (causal) self-attention — each position can only attend to earlier positions
Feed-forward network (FFN) — two linear layers with GeLU activation

Both sublayers use Pre-LN: LayerNorm is applied to the residual before the sublayer, and the sublayer output is added back (residual connection).

Signature: def transformer_decoder_block(x, Wq, Wk, Wv, Wo, W1, b1, W2, b2, gamma1, beta1, gamma2, beta2)

Causal mask: position i can attend to positions 0..i only (upper-triangle set to -1e9).

GeLU: 0.5 * x * (1 + erf(x / sqrt(2)))

Math

Asked at

A GPT-style Transformer decoder block has two sublayers (Pre-LayerNorm):

Masked (causal) self-attention — each position can only attend to earlier positions
Feed-forward network (FFN) — two linear layers with GeLU activation

Both sublayers use Pre-LN: LayerNorm is applied to the residual before the sublayer, and the sublayer output is added back (residual connection).

Signature: def transformer_decoder_block(x, Wq, Wk, Wv, Wo, W1, b1, W2, b2, gamma1, beta1, gamma2, beta2)

Causal mask: position i can attend to positions 0..i only (upper-triangle set to -1e9).

GeLU: 0.5 * x * (1 + erf(x / sqrt(2)))

Math

Asked at