Hand-derive the gradient of L = sum(softmax(x)) w.r.t. x.
Forward: y_i = exp(x_i) / sum_j exp(x_j) along the last axis. L = sum(y).
Implement:
softmax_forward(x) -> y of the same shape (softmax along last axis)softmax_backward(x) -> dL/dx of the same shapeSoftmax has a non-diagonal Jacobian — every output element depends on every input element. The full Jacobian is J_ij = y_i * (delta_ij - y_j).
For L = sum(y) (i.e. dL/dy = ones), the chain rule simplifies dramatically. Work it out — the result is shockingly clean.
Math
Related problems
Asked at
import numpy as np
def softmax_forward(...):
pass
Premium problem
Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.
Already premium?