TorchedUp
ProblemsPremium
TorchedUp
Sliding Window Attention (Mistral)Easy
ProblemsPremium

Sliding Window Attention (Mistral)

Mistral uses sliding window attention: each token attends only to the W most recent tokens (including itself). This is O(N·W) instead of O(N²), enabling efficient inference over long contexts. Combined with a rolling KV cache, it supports effectively infinite context at fixed memory cost.

Implement causal sliding window attention: token i can attend to positions max(0, i-W+1) through i inclusive.

For each query i:
    start = max(0, i - W + 1)
    K_local = K[start:i+1]
    V_local = V[start:i+1]
    scores  = Q[i] @ K_local.T / sqrt(d_k)
    weights = softmax(scores)
    out[i]  = weights @ V_local

Signature: def sliding_window_attention(Q, K, V, window_size)

  • Q, K, V: (N, d_k)
  • window_size: W
  • Returns: (N, d_v)

Hint: when W ≥ N this reduces to standard causal attention.

Math

Asked at

Python (numpy)0/3 runs today

Test Results

○seed 42, W=2, N=5, d=4
○seed 42, W=N=5, reduces to full causal attention
○seed 7, W=3, N=6, d=8🔒 Premium
○non-negative output when V is non-negative
○window locality: output[0] = V[0], independent of V[1:]
Advertisement