LayerNorm with Pre-allocated Output Buffer

Implement LayerNorm where the output is written into a caller-supplied buffer out, instead of allocating a fresh tensor. This is how production fused kernels work: an inference engine pre-allocates activation buffers once, and every layer writes into pre-existing memory.

Signature: def layernorm_inplace(x, gamma, beta, out, eps=1e-5) -> out

x: input, shape (..., D) (normalize over last axis)
gamma: per-feature scale, shape (D,)
beta: per-feature shift, shape (D,)
out: pre-allocated output buffer, same shape as x — write your result here
eps: stability constant for the variance

The function must:

Compute LayerNorm: (x - mean) / sqrt(var + eps) * gamma + beta, where mean and var are taken over the last axis.
Write the result into out (e.g. out[...] = ...).
Return out.

Constraints:

Do NOT allocate a new output array (np.empty_like(x), np.zeros_like(x), etc.). Just normalize and assign into the buffer the caller passed you.
Intermediate scalars/vectors (mean, var) are fine — they're O(B), not O(B*D).

The harness verifies the returned array equals the LayerNorm of x. One test passes in a pre-zeroed buffer to confirm you actually wrote into it (rather than returning a fresh allocation).

Math

out [i] = \frac{x _{i} - μ _{i}}{σ _{i}^{2} + ϵ} γ + β

Asked at