Modern transformers (GPT-2 onwards) use Pre-LN: normalize the input before the sublayer, then add the residual. This is more training-stable than the original Post-LN.
output = x + sublayer(LayerNorm(x))output = LayerNorm(x + sublayer(x))Implement the Pre-LN wrapper where the sublayer is a linear projection W @ x + b (simulating attention or FFN output).
Signature: def pre_layernorm_block(x, W, b, gamma, beta)
x: (d,) — inputW: (d, d), b: (d,) — sublayer weightsgamma, beta: (d,) — LayerNorm parameters(d,)LayerNorm epsilon: 1e-5.
Math
Asked at
Test Results