Paged Attention (vLLM)

vLLM's Paged Attention manages the KV cache using fixed-size pages (blocks), analogous to OS virtual memory. Instead of pre-allocating contiguous memory for each sequence's full context, a block table maps logical block indices to physical block indices, eliminating internal fragmentation and enabling efficient memory sharing.

Given:

A block table mapping logical → physical block indices
Physical K/V blocks (all blocks the GPU holds)
A query, block size, and actual sequence length

Your task: walk the block table, gather the physical K/V slices for each logical block (taking only as many tokens as the actual sequence length covers in that block), concatenate them in logical order to reconstruct the full K and V for the sequence, then run standard scaled dot-product softmax attention against the query. See the math reference below.

Signature: def paged_attention(query, block_table, physical_k_blocks, physical_v_blocks, block_size, seq_len)

query: (1, d_k)
block_table: (num_logical_blocks,) int
physical_k_blocks: (num_physical_blocks, block_size, d_k)
physical_v_blocks: (num_physical_blocks, block_size, d_v)
block_size: int
seq_len: int
Returns: (1, d_v)

Math

K V out = gather (physical_k, block_table) [: s] = gather (physical_v, block_table) [: s] = softmax (\frac{q K ^{⊤}}{d _{k}}) V

Asked at

Given:

A block table mapping logical → physical block indices
Physical K/V blocks (all blocks the GPU holds)
A query, block size, and actual sequence length

Signature: def paged_attention(query, block_table, physical_k_blocks, physical_v_blocks, block_size, seq_len)

query: (1, d_k)
block_table: (num_logical_blocks,) int
physical_k_blocks: (num_physical_blocks, block_size, d_k)
physical_v_blocks: (num_physical_blocks, block_size, d_v)
block_size: int
seq_len: int
Returns: (1, d_v)

Math

K V out = gather (physical_k, block_table) [: s] = gather (physical_v, block_table) [: s] = softmax (\frac{q K ^{⊤}}{d _{k}}) V

Asked at

66. Paged Attention (vLLM)

66. Paged Attention (vLLM)