Hand-derive the gradient of L = sum(Attention(Q, K, V)) w.r.t. Q (queries), with K and V held fixed.
Forward (single head, no mask):
S = Q @ K.T / sqrt(d_k)P = softmax(S) along the last axisy = P @ VQ, K have shape (N_q, d_k) and (N_k, d_k); V has shape (N_k, d_v); y has shape (N_q, d_v).
Implement:
attention_forward(Q, K, V) -> yattention_backward(Q, K, V) -> dL/dQ of shape (N_q, d_k)Chain (with L = sum(y) so dL/dy = ones(N_q, d_v)):
dL/dP = ones @ V.T — each row is the column sums of V.dL/dS[i, :] = P[i, :] * (dL/dP[i, :] - sum_j P[i, j] * dL/dP[i, j]).dL/dQ = (dL/dS) @ K / sqrt(d_k).Math
Asked at
import numpy as np
def attention_forward(...):
pass
Premium problem
Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.
Already premium?