TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

35. Sliding Window Attention (Mistral)

Easy

Mistral uses sliding window attention: each token attends only to the W most recent tokens (including itself). This is O(N·W) instead of O(N²), enabling efficient inference over long contexts. Combined with a rolling KV cache, it supports effectively infinite context at fixed memory cost.

Implement causal sliding window attention: token i can attend to positions max(0, i-W+1) through i inclusive (no future tokens, and no tokens older than W-1 steps back). Within that window, run the usual scaled dot-product softmax attention.

Signature: def sliding_window_attention(Q, K, V, window_size)

  • Q, K, V: (N, d_k)
  • window_size: W
  • Returns: (N, d_v)

Hint: when W ≥ N this reduces to standard causal attention.

Math

outi​si​​=softmax(dk​​Qi​K[si​:i]⊤​​)V[si​:i]​=max(0, i−W+1)​

Asked at

NumPy

import numpy as np

 

def sliding_window_attention(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?