Backprop: Attention head

Hand-derive the gradient of L = sum(Attention(Q, K, V)) w.r.t. Q (queries), with K and V held fixed.

Forward (single head, no mask):

Q, K have shape (N_q, d_k) and (N_k, d_k); V has shape (N_k, d_v); y has shape (N_q, d_v).

Implement:

Chain (with L = sum(y) so dL/dy = ones(N_q, d_v)):

dL/dP = ones @ V.T — each row is the column sums of V.
Per-row softmax backward: dL/dS[i, :] = P[i, :] * (dL/dP[i, :] - sum_j P[i, j] * dL/dP[i, j]).
dL/dQ = (dL/dS) @ K / sqrt(d_k).

Math

y = softmax (\frac{Q K ^{⊤}}{d _{k}}) V, \frac{\partial L}{\partial Q} = \frac{1}{d _{k}} \frac{\partial L}{\partial S} K

Asked at