Backprop: Softmax + Cross-Entropy (fused)

Hand-derive the gradient of cross-entropy loss applied to softmax outputs, with respect to the logits x.

Forward: Compute y = softmax(logits) along the last axis, then per-sample losses l_b = -sum_c target_{b,c} * log(y_{b,c}). Return the vector of per-sample losses (shape (batch,)); the harness then sums to a scalar.

Implement:

softmax_ce_forward(logits, target) -> losses of shape (batch,)
softmax_ce_backward(logits, target) -> dL/dlogits of shape (batch, num_classes) where L = sum(losses)

target is a one-hot matrix of the same shape as logits. The famous result: dL/dlogits = y - target (no batch division because we sum, not mean).

Math

y = softmax (x), L = - b \sum c \sum t_{b c} lo g y_{b c}, \frac{\partial L}{\partial x _{b c}} = y_{b c} - t_{b c}

205. Backprop: Softmax + Cross-Entropy (fused)

205. Backprop: Softmax + Cross-Entropy (fused)