The function below is supposed to compute scaled dot-product attention, but it has a bug. Find and fix it.
Signature: def buggy_attention(Q, K, V)
Q: (N, d_k) — query matrixK: (N, d_k) — key matrixV: (N, d_v) — value matrixHint: The scaling factor is critical for numerical stability and proper gradient flow. Look carefully at how the attention scores are divided before the softmax.
Math
Asked at
Test Results