Backprop: Softmax — TorchedUp

Hand-derive the gradient of L = sum(softmax(x)) w.r.t. x.

Forward: y_i = exp(x_i) / sum_j exp(x_j) along the last axis. L = sum(y).

Implement:

Softmax has a non-diagonal Jacobian — every output element depends on every input element. The full Jacobian is J_ij = y_i * (delta_ij - y_j).

For L = sum(y) (i.e. dL/dy = ones), the chain rule simplifies dramatically. Work it out — the result is shockingly clean.

Math

\frac{\partial y _{i}}{\partial x _{j}} = y_{i} (δ_{ij} - y_{j}), \frac{\partial L}{\partial x _{j}} = i \sum \frac{\partial y _{i}}{\partial x _{j}} = y_{j} (1 - i \sum y_{i} \cdot 1) = 0 when \sum y = 1

204. Backprop: Softmax