Backprop: Softmax (PyTorch)

Implement softmax (dim=-1) as a torch.autograd.Function. Forward returns the probability distribution; backward applies the Jacobian diag(y) - y y^T per row to grad_output.

The rule: you may NOT call F.softmax, torch.softmax, nn.Softmax, or F.log_softmax. Use .exp(), .sum(), .max().

Backward formula (per row, last dim): dL/dx_i = y_i * (grad_output_i - sum_j(grad_output_j * y_j)). Equivalent to y * (g - (g*y).sum(dim=-1, keepdim=True)).

The driver softmax_run(mode, x) dispatches 'forward' | 'grad_x' | 'gradcheck'. Note: for grad_x we use a non-uniform upstream gradient (weighted sum) so the result is non-zero — see starter code.

Math

y_{i} = \frac{e ^{x_{i}}}{\sum _{j} e ^{x_{j}}}, \frac{\partial L}{\partial x _{i}} = y_{i} (\frac{\partial L}{\partial y _{i}} - j \sum y_{j} \frac{\partial L}{\partial y _{j}})

268. Backprop: Softmax (PyTorch)

268. Backprop: Softmax (PyTorch)