Pre-LayerNorm Residual Block (PyTorch)

Implement the Pre-LN residual block in PyTorch using primitive tensor ops only:

output = x + sublayer(LayerNorm(x))

The sublayer is a linear projection x_norm @ W.T + b (standing in for attention or FFN output).

Signature: def pre_layernorm_block(x, W, b, gamma, beta) -> torch.Tensor

The rule: you may NOT call nn.LayerNorm or F.layer_norm. Build LN from .mean() / .var().

PyTorch idioms vs NumPy:

x.var(dim=-1, keepdim=True, unbiased=False) — population variance to match LN convention. Default unbiased=True is wrong here.
x_norm @ W.T + b — matmul broadcasts naturally over leading dims, so this works for 1D, 2D, and 3D inputs without reshaping.
The residual x + sublayer(...) is not normalized again — that's the defining property of Pre-LN vs Post-LN, and it's why GPT-style models train stably without warmup.

Math

output = x + sublayer (LayerNorm (x))

Related problems

Asked at