TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

210. Backprop: Attention head

Hard

Hand-derive the gradient of L = sum(Attention(Q, K, V)) w.r.t. Q (queries), with K and V held fixed.

Forward (single head, no mask):

  • Scores: S = Q @ K.T / sqrt(d_k)
  • Probs: P = softmax(S) along the last axis
  • Output: y = P @ V

Q, K have shape (N_q, d_k) and (N_k, d_k); V has shape (N_k, d_v); y has shape (N_q, d_v).

Implement:

  • attention_forward(Q, K, V) -> y
  • attention_backward(Q, K, V) -> dL/dQ of shape (N_q, d_k)

Chain (with L = sum(y) so dL/dy = ones(N_q, d_v)):

  1. dL/dP = ones @ V.T — each row is the column sums of V.
  2. Per-row softmax backward: dL/dS[i, :] = P[i, :] * (dL/dP[i, :] - sum_j P[i, j] * dL/dP[i, j]).
  3. dL/dQ = (dL/dS) @ K / sqrt(d_k).

Math

y=softmax(dk​​QK⊤​)V,∂Q∂L​=dk​​1​∂S∂L​K

Asked at

NumPy

import numpy as np

 

def attention_forward(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?