Debug: Attention Scale

The function below is supposed to compute scaled dot-product attention, but it has a bug. Find and fix it.

Signature: def buggy_attention(Q, K, V)

Q: (N, d_k) — query matrix
K: (N, d_k) — key matrix
V: (N, d_v) — value matrix
Returns: (N, d_v)

Hint: The scaling factor is critical for numerical stability and proper gradient flow. Look carefully at how the attention scores are divided before the softmax.

Math

Asked at

Python (numpy)0/3 runs today

Test Results

○random Q, K, V (seed 42, d_k=8)

○identity Q, K, V (d_k=2, bug is most visible)

○random Q, K, V (seed 7, d_k=16)🔒 Premium