8. Scaled Dot-Product Attention

Medium

Implement scaled dot-product attention.

Signature: def scaled_dot_product_attention(Q: np.ndarray, K: np.ndarray, V: np.ndarray) -> np.ndarray

Each query attends to all keys with similarity scaled by sqrt(d_k), the result is softmax-normalized across keys, then used to take a weighted sum over the values. d_k is the key dimension. See the math reference below.

Math

Attention (Q, K, V) = softmax (\frac{Q K ^{⊤}}{d _{k}}) V