Scaled Dot-Product Attention

Implement scaled dot-product attention.

Signature: def scaled_dot_product_attention(Q: np.ndarray, K: np.ndarray, V: np.ndarray) -> np.ndarray

Compute softmax(QK^T / sqrt(d_k)) @ V where d_k is the key dimension.

Math

Asked at

Python (numpy)0/3 runs today

Test Results

○identity QKV

○uniform keys

○peaked scores🔒 Premium

○output non-negative when V is non-negative

○single-query attention: output row sums to 1 when V=I

○shift-invariant on Q when keys are uniform across rows