TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

204. Backprop: Softmax

Medium

Hand-derive the gradient of L = sum(softmax(x)) w.r.t. x.

Forward: y_i = exp(x_i) / sum_j exp(x_j) along the last axis. L = sum(y).

Implement:

  • softmax_forward(x) -> y of the same shape (softmax along last axis)
  • softmax_backward(x) -> dL/dx of the same shape

Softmax has a non-diagonal Jacobian — every output element depends on every input element. The full Jacobian is J_ij = y_i * (delta_ij - y_j).

For L = sum(y) (i.e. dL/dy = ones), the chain rule simplifies dramatically. Work it out — the result is shockingly clean.

Math

∂xj​∂yi​​=yi​(δij​−yj​),∂xj​∂L​=i∑​∂xj​∂yi​​=yj​(1−i∑​yi​⋅1)=0 when ∑y=1

Related problems

  • Backprop: Softmax (PyTorch)mediumPyTorch

Asked at

NumPy

import numpy as np

 

def softmax_forward(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?