TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

87. Attention Sinks (StreamingLLM)

Medium

Implement attention with sinks as described in StreamingLLM (Xiao et al. 2023). This enables LLMs to run on sequences longer than their training context window by keeping only:

  1. The first n_sink tokens (called attention sinks)
  2. A sliding window of the most recent window_size tokens

Signature: def attention_with_sinks(Q, K, V, n_sink, window_size)

  • Q, K, V: (S, d) — query/key/value matrices
  • n_sink: number of initial sink tokens to always keep
  • window_size: number of recent tokens in the sliding window
  • Returns: (S, d) — attention output

Attention Mask

For query position i, the set of visible key positions is:

visible = {j : j < n_sink} ∪ {j : max(0, i - window_size + 1) ≤ j ≤ i}

Build a mask: mask[i, j] = 0 if visible, -1e9 otherwise.

Then: softmax((Q @ K.T / sqrt(d)) + mask) @ V

Why Sinks?

The first few tokens receive disproportionately high attention (even if semantically irrelevant) because they are always in the context window during training. Keeping them during streaming inference stabilizes the attention distribution and prevents performance degradation on long sequences.

Asked at

NumPy

import numpy as np

 

def attention_with_sinks(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?