Prefix Caching (Prompt KV Reuse)

Implement a prefix cache lookup that finds the longest cached prefix for a new request, enabling KV cache reuse across requests with shared system prompts.

Signature: def prefix_cache_lookup(new_tokens, cache)

new_tokens: list of token IDs for the new request
cache: list of cached entries — each entry is a dict {"tokens": [int, ...]} (variable-length token sequences)
Returns: (cache_hit_length, tokens_to_compute)
- cache_hit_length: int — how many tokens can be reused from cache
- tokens_to_compute: list — the remaining tokens that need KV computation

Algorithm: Longest Prefix Match

best_match = 0
for each sequence in cache:
    # Count matching tokens from the start
    match_len = 0
    for i in range(min(len(sequence), len(new_tokens))):
        if sequence[i] == new_tokens[i]:
            match_len += 1
        else:
            break
    best_match = max(best_match, match_len)

cache_hit_length = best_match
tokens_to_compute = new_tokens[best_match:]

Real-World Impact

With a 500-token system prompt shared across 1000 requests/minute, prefix caching reuses those 500 tokens' KV cache for every request — reducing prefill compute by 500× for the shared portion. vLLM implements this via RadixAttention (a radix tree over token sequences).

Asked at

Prefix Caching (Prompt KV Reuse)

Implement a prefix cache lookup that finds the longest cached prefix for a new request, enabling KV cache reuse across requests with shared system prompts.

Signature: def prefix_cache_lookup(new_tokens, cache)

new_tokens: list of token IDs for the new request
cache: list of cached entries — each entry is a dict {"tokens": [int, ...]} (variable-length token sequences)
Returns: (cache_hit_length, tokens_to_compute)
- cache_hit_length: int — how many tokens can be reused from cache
- tokens_to_compute: list — the remaining tokens that need KV computation

Algorithm: Longest Prefix Match

best_match = 0
for each sequence in cache:
    # Count matching tokens from the start
    match_len = 0
    for i in range(min(len(sequence), len(new_tokens))):
        if sequence[i] == new_tokens[i]:
            match_len += 1
        else:
            break
    best_match = max(best_match, match_len)

cache_hit_length = best_match
tokens_to_compute = new_tokens[best_match:]

Real-World Impact

Asked at