Implement a prefix cache lookup that finds the longest cached prefix for a new request, enabling KV cache reuse across requests with shared system prompts.
Signature: def prefix_cache_lookup(new_tokens, cache)
new_tokens: list of token IDs for the new requestcache: list of cached entries — each entry is a dict {"tokens": [int, ...]} (variable-length token sequences)(cache_hit_length, tokens_to_compute)
cache_hit_length: int — how many tokens can be reused from cachetokens_to_compute: list — the remaining tokens that need KV computationbest_match = 0
for each sequence in cache:
# Count matching tokens from the start
match_len = 0
for i in range(min(len(sequence), len(new_tokens))):
if sequence[i] == new_tokens[i]:
match_len += 1
else:
break
best_match = max(best_match, match_len)
cache_hit_length = best_match
tokens_to_compute = new_tokens[best_match:]
With a 500-token system prompt shared across 1000 requests/minute, prefix caching reuses those 500 tokens' KV cache for every request — reducing prefill compute by 500× for the shared portion. vLLM implements this via RadixAttention (a radix tree over token sequences).
Asked at
Test Results