Pre-LayerNorm Residual Block

Modern transformers (GPT-2 onwards) use Pre-LN: normalize the input before the sublayer, then add the residual. This is more training-stable than the original Post-LN.

Pre-LN: output = x + sublayer(LayerNorm(x))
Post-LN: output = LayerNorm(x + sublayer(x))

Implement the Pre-LN wrapper where the sublayer is a linear projection W @ x + b (simulating attention or FFN output).

Signature: def pre_layernorm_block(x, W, b, gamma, beta)

x: (d,) — input
W: (d, d), b: (d,) — sublayer weights
gamma, beta: (d,) — LayerNorm parameters
Returns: (d,)

LayerNorm epsilon: 1e-5.

Math

Asked at