TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

75. Causal Attention Mask

Easy

Autoregressive models (GPT family) must not attend to future tokens. A causal mask zeroes out the contribution from positions j > i so that token i can only attend to positions 0..i inclusive. The standard recipe is to add a large negative number to those entries before the softmax (use -1e9 here for numerical equivalence with the reference solution).

Implement causal masked attention.

Signature: def causal_attention(Q, K, V)

  • Q, K, V: (N, d_k) — query, key, value matrices (same sequence)
  • Returns: (N, d_v) — attention output where position i only attends to positions 0..i

Math

Attention(Q,K,V)=softmax(dk​​QK⊤​+M)V

Related problems

  • Causal Attention Mask (PyTorch)easyPyTorch

Asked at

NumPy

import numpy as np

 

def causal_attention(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?