Backprop: BatchNorm (train mode)

Hand-derive the gradient of L = sum(BatchNorm(x)) w.r.t. the input x in training mode (using batch statistics, no running mean/var).

Forward (per feature column, along the batch axis):

mu = mean(x, axis=0), var = mean((x - mu)^2, axis=0), std = sqrt(var + eps)
x_hat = (x - mu) / std
y = gamma * x_hat + beta

Implement:

batchnorm_forward(x, gamma, beta, eps=1e-5) -> y of shape (N, F)
batchnorm_backward(x, gamma, beta, eps=1e-5) -> dL/dx of shape (N, F)

The math is identical to LayerNorm but transposed: now the reductions go along the batch dim instead of the feature dim. With L = sum(y) so dL/dx_hat = gamma (broadcast across the batch),

dL/dx = (1/std) * (dL/dx_hat - mean_batch(dL/dx_hat) - x_hat * mean_batch(dL/dx_hat * x_hat))

Note: mean(gamma) along the batch axis is just gamma itself, so that term simplifies and effectively the first two terms cancel. Verify what's left.

Math

\frac{\partial L}{\partial x _{nj}} = \frac{1}{σ _{j}} (γ_{j} - \overline{γ_{j}}_{(n)} - \overset{x}{^}_{nj} \cdot \overline{γ_{j} \overset{x}{^}_{nj}}_{(n)})

Related problems

Backprop: BatchNorm (PyTorch)hardPyTorch

Asked at