TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

66. Paged Attention (vLLM)

Hard

vLLM's Paged Attention manages the KV cache using fixed-size pages (blocks), analogous to OS virtual memory. Instead of pre-allocating contiguous memory for each sequence's full context, a block table maps logical block indices to physical block indices, eliminating internal fragmentation and enabling efficient memory sharing.

Given:

  • A block table mapping logical → physical block indices
  • Physical K/V blocks (all blocks the GPU holds)
  • A query, block size, and actual sequence length

Your task: walk the block table, gather the physical K/V slices for each logical block (taking only as many tokens as the actual sequence length covers in that block), concatenate them in logical order to reconstruct the full K and V for the sequence, then run standard scaled dot-product softmax attention against the query. See the math reference below.

Signature: def paged_attention(query, block_table, physical_k_blocks, physical_v_blocks, block_size, seq_len)

  • query: (1, d_k)
  • block_table: (num_logical_blocks,) int
  • physical_k_blocks: (num_physical_blocks, block_size, d_k)
  • physical_v_blocks: (num_physical_blocks, block_size, d_v)
  • block_size: int
  • seq_len: int
  • Returns: (1, d_v)

Math

KVout​=gather(physical_k, block_table)[:s]=gather(physical_v, block_table)[:s]=softmax(dk​​qK⊤​)V​

Asked at

NumPy

import numpy as np

 

def paged_attention(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?