Backprop: Linear (matmul + bias)

Hand-derive the gradient of a linear layer's output (summed to a scalar) with respect to its input x.

Forward: y = x @ W.T + b, then L = sum(y)

Implement two functions:

linear_forward(x, W, b) -> y — produces y of shape (batch, out_features)
linear_backward(x, W, b) -> dL_dx of shape (batch, in_features) — gradient of L = sum(y) w.r.t. x

The harness verifies your analytic linear_backward by central-differencing your linear_forward at every position of x. If they match within tolerance, you've gotten the chain rule right.

Hint: When L = sum(y) and y = x @ W.T + b, dL/dy = ones, so dL/dx = ones @ W = W.sum(axis=0) broadcast to every batch row.

Math

y = x W^{⊤} + b, L = i \sum y_{i}, \frac{\partial L}{\partial x} = 1 \cdot W

200. Backprop: Linear (matmul + bias)

200. Backprop: Linear (matmul + bias)