TorchedUp
ProblemsPremium
TorchedUp
Pre-LayerNorm Residual BlockEasy
ProblemsPremium

Pre-LayerNorm Residual Block

Modern transformers (GPT-2 onwards) use Pre-LN: normalize the input before the sublayer, then add the residual. This is more training-stable than the original Post-LN.

  • Pre-LN: output = x + sublayer(LayerNorm(x))
  • Post-LN: output = LayerNorm(x + sublayer(x))

Implement the Pre-LN wrapper where the sublayer is a linear projection W @ x + b (simulating attention or FFN output).

Signature: def pre_layernorm_block(x, W, b, gamma, beta)

  • x: (d,) — input
  • W: (d, d), b: (d,) — sublayer weights
  • gamma, beta: (d,) — LayerNorm parameters
  • Returns: (d,)

LayerNorm epsilon: 1e-5.

Math

Asked at

Python (numpy)0/3 runs today

Test Results

○seed 42 — d=4
○identity W gamma=1 beta=0 — output = x + LayerNorm(x)
○seed 7 — d=8🔒 Premium
Advertisement