Attention Sinks (StreamingLLM)

Implement attention with sinks as described in StreamingLLM (Xiao et al. 2023). This enables LLMs to run on sequences longer than their training context window by keeping only:

The first n_sink tokens (called attention sinks)
A sliding window of the most recent window_size tokens

Signature: def attention_with_sinks(Q, K, V, n_sink, window_size)

Q, K, V: (S, d) — query/key/value matrices
n_sink: number of initial sink tokens to always keep
window_size: number of recent tokens in the sliding window
Returns: (S, d) — attention output

Attention Mask

For query position i, the set of visible key positions is:

visible = {j : j < n_sink} ∪ {j : max(0, i - window_size + 1) ≤ j ≤ i}

Build a mask: mask[i, j] = 0 if visible, -1e9 otherwise.

Then: softmax((Q @ K.T / sqrt(d)) + mask) @ V

Why Sinks?

The first few tokens receive disproportionately high attention (even if semantically irrelevant) because they are always in the context window during training. Keeping them during streaming inference stabilizes the attention distribution and prevents performance degradation on long sequences.

Asked at