Prefix Cache Hit Rate

A prefix cache stores KV blocks keyed by token-prefix. When a new prompt arrives, contiguous block-aligned prefix matches are reused. Compute the block-level cache hit rate across a list of prompts processed in order.

Signature: def prefix_cache_hit_rate(prompts: list, block_size: int = 16) -> float

prompts is a list of token-id lists. Tokenize each prompt into blocks of size block_size (drop the trailing partial block). Walk blocks left-to-right; a block is a hit if its prefix tuple (all blocks up to and including this one, as a tuple of tuples) was already seen during a previous prompt.

Return: total_hits / total_blocks (0.0 if total_blocks is 0).

Example:

prompts = [[1]*32, [1]*32 + [2]*16], block_size=16. First prompt: 2 blocks, no hits. Second: first 2 blocks match → 2 hits, third doesn't. 2 hits / 5 blocks = 0.4.

Math

hit rate = \frac{\sum _{p} cached blocks ( p )}{\sum _{p} total blocks ( p )}

Asked at

Signature: def prefix_cache_hit_rate(prompts: list, block_size: int = 16) -> float

Return: total_hits / total_blocks (0.0 if total_blocks is 0).

Example:

prompts = [[1]*32, [1]*32 + [2]*16], block_size=16. First prompt: 2 blocks, no hits. Second: first 2 blocks match → 2 hits, third doesn't. 2 hits / 5 blocks = 0.4.

Math

hit rate = \frac{\sum _{p} cached blocks ( p )}{\sum _{p} total blocks ( p )}

Asked at

145. Prefix Cache Hit Rate

145. Prefix Cache Hit Rate