TorchedUp
ProblemsPremium
TorchedUp
Transformer Decoder BlockHard
ProblemsPremium

Transformer Decoder Block

A GPT-style Transformer decoder block has two sublayers (Pre-LayerNorm):

  1. Masked (causal) self-attention — each position can only attend to earlier positions
  2. Feed-forward network (FFN) — two linear layers with GeLU activation

Both sublayers use Pre-LN: LayerNorm is applied to the residual before the sublayer, and the sublayer output is added back (residual connection).

Signature: def transformer_decoder_block(x, Wq, Wk, Wv, Wo, W1, b1, W2, b2, gamma1, beta1, gamma2, beta2)

  • x: (seq, d) — input sequence
  • Wq, Wk, Wv, Wo: (d, d) — query/key/value/output projection weights
  • W1: (d_ff, d), b1: (d_ff,) — FFN first layer
  • W2: (d, d_ff), b2: (d,) — FFN second layer
  • gamma1, beta1: (d,) — LN params for self-attention
  • gamma2, beta2: (d,) — LN params for FFN
  • Returns: (seq, d)

Causal mask: position i can attend to positions 0..i only (upper-triangle set to -1e9).

GeLU: 0.5 * x * (1 + erf(x / sqrt(2)))

Math

Asked at

Python (numpy)0/3 runs today

Test Results

○seq=3, d=4, d_ff=8, ones gamma/zeros beta (seed 42)
○Causal mask: token 0 attends only to itself
○seq=4, d=4, d_ff=8, non-trivial LN params (seed 7)🔒 Premium
Advertisement