TorchedUp
ProblemsPremium
TorchedUp
Prefix Caching (Prompt KV Reuse)Hard
ProblemsPremium

Prefix Caching (Prompt KV Reuse)

Implement a prefix cache lookup that finds the longest cached prefix for a new request, enabling KV cache reuse across requests with shared system prompts.

Signature: def prefix_cache_lookup(new_tokens, cache)

  • new_tokens: list of token IDs for the new request
  • cache: list of cached entries — each entry is a dict {"tokens": [int, ...]} (variable-length token sequences)
  • Returns: (cache_hit_length, tokens_to_compute)
    • cache_hit_length: int — how many tokens can be reused from cache
    • tokens_to_compute: list — the remaining tokens that need KV computation

Algorithm: Longest Prefix Match

best_match = 0
for each sequence in cache:
    # Count matching tokens from the start
    match_len = 0
    for i in range(min(len(sequence), len(new_tokens))):
        if sequence[i] == new_tokens[i]:
            match_len += 1
        else:
            break
    best_match = max(best_match, match_len)

cache_hit_length = best_match
tokens_to_compute = new_tokens[best_match:]

Real-World Impact

With a 500-token system prompt shared across 1000 requests/minute, prefix caching reuses those 500 tokens' KV cache for every request — reducing prefill compute by 500× for the shared portion. vLLM implements this via RadixAttention (a radix tree over token sequences).

Asked at

Python (numpy)0/3 runs today

Test Results

○exact prefix match: 3 cached tokens reused
○no match: all tokens must be computed
○full match: entire sequence is cached
○pick longest among multiple partial matches🔒 Premium
Advertisement