Autoregressive models (GPT family) must not attend to future tokens. A causal mask zeroes out the contribution from positions j > i so that token i can only attend to positions 0..i inclusive. The standard recipe is to add a large negative number to those entries before the softmax (use -1e9 here for numerical equivalence with the reference solution).
Implement causal masked attention.
Signature: def causal_attention(Q, K, V)
Q, K, V: (N, d_k) — query, key, value matrices (same sequence)(N, d_v) — attention output where position i only attends to positions 0..iMath
Related problems
Asked at
import numpy as np
def causal_attention(...):
pass
Premium problem
Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.
Already premium?