Paged Attention (vLLM)

vLLM's Paged Attention manages the KV cache using fixed-size pages (blocks), analogous to OS virtual memory. Instead of pre-allocating contiguous memory for each sequence's full context, a block table maps logical block indices to physical block indices, eliminating internal fragmentation and enabling efficient memory sharing.

Given:

A block table mapping logical → physical block indices
Physical K/V blocks (all blocks the GPU holds)
A query, block size, and actual sequence length

Your task: look up the correct physical blocks, reconstruct the full K/V for the sequence, then compute softmax attention.

For each logical block:
    phys = block_table[logical]
    tokens = min(block_size, seq_len - logical*block_size)
    K_seq  += physical_k[phys, :tokens]
    V_seq  += physical_v[phys, :tokens]
scores = query @ K_seq.T / sqrt(d_k)
weights = softmax(scores)
output  = weights @ V_seq

Signature: def paged_attention(query, block_table, physical_k_blocks, physical_v_blocks, block_size, seq_len)

query: (1, d_k)
block_table: (num_logical_blocks,) int
physical_k_blocks: (num_physical_blocks, block_size, d_k)
physical_v_blocks: (num_physical_blocks, block_size, d_v)
block_size: int
seq_len: int
Returns: (1, d_v)

Math

Asked at