A GPT-style Transformer decoder block has two sublayers (Pre-LayerNorm):
Both sublayers use Pre-LN: LayerNorm is applied to the residual before the sublayer, and the sublayer output is added back (residual connection).
Signature: def transformer_decoder_block(x, Wq, Wk, Wv, Wo, W1, b1, W2, b2, gamma1, beta1, gamma2, beta2)
x: (seq, d) — input sequenceWq, Wk, Wv, Wo: (d, d) — query/key/value/output projection weightsW1: (d_ff, d), b1: (d_ff,) — FFN first layerW2: (d, d_ff), b2: (d,) — FFN second layergamma1, beta1: (d,) — LN params for self-attentiongamma2, beta2: (d,) — LN params for FFN(seq, d)Causal mask: position i can attend to positions 0..i only (upper-triangle set to -1e9).
GeLU: 0.5 * x * (1 + erf(x / sqrt(2)))
Math
Asked at
Test Results