Implement the Pre-LN residual block in PyTorch using primitive tensor ops only:
output = x + sublayer(LayerNorm(x))
The sublayer is a linear projection x_norm @ W.T + b (standing in for attention or FFN output).
Signature: def pre_layernorm_block(x, W, b, gamma, beta) -> torch.Tensor
x: (..., d)W: (d, d), b: (d,)gamma, beta: (d,)eps = 1e-5The rule: you may NOT call nn.LayerNorm or F.layer_norm. Build LN from .mean() / .var().
PyTorch idioms vs NumPy:
x.var(dim=-1, keepdim=True, unbiased=False) — population variance to match LN convention. Default unbiased=True is wrong here.x_norm @ W.T + b — matmul broadcasts naturally over leading dims, so this works for 1D, 2D, and 3D inputs without reshaping.x + sublayer(...) is not normalized again — that's the defining property of Pre-LN vs Post-LN, and it's why GPT-style models train stably without warmup.Math
Related problems
Asked at
import numpy as np
def pre_layernorm_block(...):
pass
Premium problem
Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.
Already premium?