vLLM's Paged Attention manages the KV cache using fixed-size pages (blocks), analogous to OS virtual memory. Instead of pre-allocating contiguous memory for each sequence's full context, a block table maps logical block indices to physical block indices, eliminating internal fragmentation and enabling efficient memory sharing.
Given:
Your task: walk the block table, gather the physical K/V slices for each logical block (taking only as many tokens as the actual sequence length covers in that block), concatenate them in logical order to reconstruct the full K and V for the sequence, then run standard scaled dot-product softmax attention against the query. See the math reference below.
Signature: def paged_attention(query, block_table, physical_k_blocks, physical_v_blocks, block_size, seq_len)
query: (1, d_k)block_table: (num_logical_blocks,) intphysical_k_blocks: (num_physical_blocks, block_size, d_k)physical_v_blocks: (num_physical_blocks, block_size, d_v)block_size: intseq_len: intMath
Asked at
import numpy as np
def paged_attention(...):
pass
Premium problem
Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.
Already premium?