Implement softmax (dim=-1) as a torch.autograd.Function. Forward returns the probability distribution; backward applies the Jacobian diag(y) - y y^T per row to grad_output.
The rule: you may NOT call F.softmax, torch.softmax, nn.Softmax, or F.log_softmax. Use .exp(), .sum(), .max().
Backward formula (per row, last dim): dL/dx_i = y_i * (grad_output_i - sum_j(grad_output_j * y_j)). Equivalent to y * (g - (g*y).sum(dim=-1, keepdim=True)).
The driver softmax_run(mode, x) dispatches 'forward' | 'grad_x' | 'gradcheck'. Note: for grad_x we use a non-uniform upstream gradient (weighted sum) so the result is non-zero — see starter code.
Math
Related problems
Asked at
import numpy as np
def softmax_run(...):
pass
Premium problem
Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.
Already premium?