Causal Attention Mask

Autoregressive models (GPT family) must not attend to future tokens. A causal mask sets the upper-triangle of the attention score matrix to -inf before softmax, ensuring token i can only attend to positions 0..i.

Implement causal masked attention.

Signature: def causal_attention(Q, K, V)

Q, K, V: (N, d_k) — query, key, value matrices (same sequence)
Returns: (N, d_v) — attention output where position i only attends to positions 0..i

Steps:

Compute scaled dot-product scores: Q @ K.T / sqrt(d_k)
Mask the upper triangle (positions j > i) with -1e9
Apply softmax row-wise
Return weighted sum of V

Math

Asked at

Python (numpy)0/3 runs today

Test Results

○N=3 seed 42 — first token sees only itself

○N=4 Q=K=V=I — diagonal weight structure

○N=5 seed 7 — hidden🔒 Premium

○non-negative output when V is non-negative

○causal: position 0 output equals V[0] regardless of future V