TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

197. Prefix Caching (Prompt KV Reuse)

Hard

Implement a prefix cache lookup that finds the longest cached prefix for a new request, enabling KV cache reuse across requests with shared system prompts.

Signature: def prefix_cache_lookup(new_tokens, cache)

  • new_tokens: list of token IDs for the new request
  • cache: list of cached entries — each entry is a dict {"tokens": [int, ...]} (variable-length token sequences)
  • Returns: (cache_hit_length, tokens_to_compute)
    • cache_hit_length: int — how many tokens can be reused from cache
    • tokens_to_compute: list — the remaining tokens that need KV computation

Algorithm: Longest Prefix Match

For each cached sequence, count how many tokens match new_tokens starting from index 0 and stopping at the first mismatch (a maximal common prefix, not a substring or LCS). Take the maximum match length over all cache entries — that's the cache hit length. The remaining tokens that still need KV computation are the suffix of new_tokens beyond that hit length. If the cache is empty or nothing matches, the hit length is 0 and tokens_to_compute equals the full input.

Real-World Impact

With a 500-token system prompt shared across 1000 requests/minute, prefix caching reuses those 500 tokens' KV cache for every request — reducing prefill compute by 500× for the shared portion. vLLM implements this via RadixAttention (a radix tree over token sequences).

Asked at

NumPy

import numpy as np

 

def prefix_cache_lookup(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?