TorchedUp
ProblemsPremium
TorchedUp
Attention Sinks (StreamingLLM)Medium
ProblemsPremium

Attention Sinks (StreamingLLM)

Implement attention with sinks as described in StreamingLLM (Xiao et al. 2023). This enables LLMs to run on sequences longer than their training context window by keeping only:

  1. The first n_sink tokens (called attention sinks)
  2. A sliding window of the most recent window_size tokens

Signature: def attention_with_sinks(Q, K, V, n_sink, window_size)

  • Q, K, V: (S, d) — query/key/value matrices
  • n_sink: number of initial sink tokens to always keep
  • window_size: number of recent tokens in the sliding window
  • Returns: (S, d) — attention output

Attention Mask

For query position i, the set of visible key positions is:

visible = {j : j < n_sink} ∪ {j : max(0, i - window_size + 1) ≤ j ≤ i}

Build a mask: mask[i, j] = 0 if visible, -1e9 otherwise.

Then: softmax((Q @ K.T / sqrt(d)) + mask) @ V

Why Sinks?

The first few tokens receive disproportionately high attention (even if semantically irrelevant) because they are always in the context window during training. Keeping them during streaming inference stabilizes the attention distribution and prevents performance degradation on long sequences.

Asked at

Python (numpy)0/3 runs today

Test Results

○seed=42, S=5, d=4, n_sink=1, window=2
○n_sink=0, window=1 → causal attention (each token sees only itself)
○n_sink=S, window=0 → all tokens see all sink tokens (full attention)🔒 Premium
○non-negative output when V is non-negative
○window locality: n_sink=0 window=1 → output = V
Advertisement