TorchedUp
ProblemsPremium
TorchedUp
Causal Attention MaskEasy
ProblemsPremium

Causal Attention Mask

Autoregressive models (GPT family) must not attend to future tokens. A causal mask sets the upper-triangle of the attention score matrix to -inf before softmax, ensuring token i can only attend to positions 0..i.

Implement causal masked attention.

Signature: def causal_attention(Q, K, V)

  • Q, K, V: (N, d_k) — query, key, value matrices (same sequence)
  • Returns: (N, d_v) — attention output where position i only attends to positions 0..i

Steps:

  1. Compute scaled dot-product scores: Q @ K.T / sqrt(d_k)
  2. Mask the upper triangle (positions j > i) with -1e9
  3. Apply softmax row-wise
  4. Return weighted sum of V

Math

Asked at

Python (numpy)0/3 runs today

Test Results

○N=3 seed 42 — first token sees only itself
○N=4 Q=K=V=I — diagonal weight structure
○N=5 seed 7 — hidden🔒 Premium
○non-negative output when V is non-negative
○causal: position 0 output equals V[0] regardless of future V
Advertisement