Implement a prefix cache lookup that finds the longest cached prefix for a new request, enabling KV cache reuse across requests with shared system prompts.
Signature: def prefix_cache_lookup(new_tokens, cache)
new_tokens: list of token IDs for the new requestcache: list of cached entries — each entry is a dict {"tokens": [int, ...]} (variable-length token sequences)(cache_hit_length, tokens_to_compute)
cache_hit_length: int — how many tokens can be reused from cachetokens_to_compute: list — the remaining tokens that need KV computationFor each cached sequence, count how many tokens match new_tokens starting from index 0 and stopping at the first mismatch (a maximal common prefix, not a substring or LCS). Take the maximum match length over all cache entries — that's the cache hit length. The remaining tokens that still need KV computation are the suffix of new_tokens beyond that hit length. If the cache is empty or nothing matches, the hit length is 0 and tokens_to_compute equals the full input.
With a 500-token system prompt shared across 1000 requests/minute, prefix caching reuses those 500 tokens' KV cache for every request — reducing prefill compute by 500× for the shared portion. vLLM implements this via RadixAttention (a radix tree over token sequences).
Asked at
import numpy as np
def prefix_cache_lookup(...):
pass
Premium problem
Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.
Already premium?