Autoregressive models (GPT family) must not attend to future tokens. A causal mask sets the upper-triangle of the attention score matrix to -inf before softmax, ensuring token i can only attend to positions 0..i.
Implement causal masked attention.
Signature: def causal_attention(Q, K, V)
Q, K, V: (N, d_k) — query, key, value matrices (same sequence)(N, d_v) — attention output where position i only attends to positions 0..iSteps:
Q @ K.T / sqrt(d_k)j > i) with -1e9VMath
Asked at
Test Results