TorchedUp
ProblemsPremium
TorchedUp
Paged Attention (vLLM)Hard
ProblemsPremium

Paged Attention (vLLM)

vLLM's Paged Attention manages the KV cache using fixed-size pages (blocks), analogous to OS virtual memory. Instead of pre-allocating contiguous memory for each sequence's full context, a block table maps logical block indices to physical block indices, eliminating internal fragmentation and enabling efficient memory sharing.

Given:

  • A block table mapping logical → physical block indices
  • Physical K/V blocks (all blocks the GPU holds)
  • A query, block size, and actual sequence length

Your task: look up the correct physical blocks, reconstruct the full K/V for the sequence, then compute softmax attention.

For each logical block:
    phys = block_table[logical]
    tokens = min(block_size, seq_len - logical*block_size)
    K_seq  += physical_k[phys, :tokens]
    V_seq  += physical_v[phys, :tokens]
scores = query @ K_seq.T / sqrt(d_k)
weights = softmax(scores)
output  = weights @ V_seq

Signature: def paged_attention(query, block_table, physical_k_blocks, physical_v_blocks, block_size, seq_len)

  • query: (1, d_k)
  • block_table: (num_logical_blocks,) int
  • physical_k_blocks: (num_physical_blocks, block_size, d_k)
  • physical_v_blocks: (num_physical_blocks, block_size, d_v)
  • block_size: int
  • seq_len: int
  • Returns: (1, d_v)

Math

Asked at

Python (numpy)0/3 runs today

Test Results

○seed 42, block_size=2, seq_len=5, scattered blocks
○seed 7, block_size=4, seq_len=4, single full block
○seed 13, block_size=3, seq_len=7, partial last block🔒 Premium
○non-negative output when all V entries are non-negative
○V=I_seq → output (attention weights row) sums to 1
Advertisement