Implement attention with sinks as described in StreamingLLM (Xiao et al. 2023). This enables LLMs to run on sequences longer than their training context window by keeping only:
n_sink tokens (called attention sinks)window_size tokensSignature: def attention_with_sinks(Q, K, V, n_sink, window_size)
Q, K, V: (S, d) — query/key/value matricesn_sink: number of initial sink tokens to always keepwindow_size: number of recent tokens in the sliding window(S, d) — attention outputFor query position i, the set of visible key positions is:
visible = {j : j < n_sink} ∪ {j : max(0, i - window_size + 1) ≤ j ≤ i}
Build a mask: mask[i, j] = 0 if visible, -1e9 otherwise.
Then: softmax((Q @ K.T / sqrt(d)) + mask) @ V
The first few tokens receive disproportionately high attention (even if semantically irrelevant) because they are always in the context window during training. Keeping them during streaming inference stabilizes the attention distribution and prevents performance degradation on long sequences.
Asked at
Test Results