vLLM's Paged Attention manages the KV cache using fixed-size pages (blocks), analogous to OS virtual memory. Instead of pre-allocating contiguous memory for each sequence's full context, a block table maps logical block indices to physical block indices, eliminating internal fragmentation and enabling efficient memory sharing.
Given:
Your task: look up the correct physical blocks, reconstruct the full K/V for the sequence, then compute softmax attention.
For each logical block:
phys = block_table[logical]
tokens = min(block_size, seq_len - logical*block_size)
K_seq += physical_k[phys, :tokens]
V_seq += physical_v[phys, :tokens]
scores = query @ K_seq.T / sqrt(d_k)
weights = softmax(scores)
output = weights @ V_seq
Signature: def paged_attention(query, block_table, physical_k_blocks, physical_v_blocks, block_size, seq_len)
query: (1, d_k)block_table: (num_logical_blocks,) intphysical_k_blocks: (num_physical_blocks, block_size, d_k)physical_v_blocks: (num_physical_blocks, block_size, d_v)block_size: intseq_len: intMath
Asked at
Test Results