TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

29. KV Cache

Medium

During autoregressive generation, a language model generates one token at a time. Without caching, we'd recompute the keys and values for every past token at each step — O(N²) total compute.

The KV Cache stores the keys and values from all past tokens. For each new token, we:

  1. Compute the new key and value
  2. Append them to the cache
  3. Compute attention between the new query and the full cached key/value sequence

Implement kv_cache_attention: given the existing key-value cache (past_keys, past_values) and a new token's (new_key, new_value, query), append the new key/value to the cache and return the attention output for this token.

Signature: def kv_cache_attention(past_keys, past_values, new_key, new_value, query)

  • past_keys: (n_past, d_k)
  • past_values: (n_past, d_v)
  • new_key: (1, d_k) — current token's key
  • new_value: (1, d_v) — current token's value
  • query: (1, d_k) — current token's query
  • Returns: output of shape (1, d_v)

Scale by sqrt(d_k). Use standard softmax attention.

Math

Keys=[Kpast​knew​​],Values=[Vpast​vnew​​]Attention(q,K,V)=softmax(dk​​qKT​)V

Related problems

  • KV Cache in a Decoder LoopmediumNumPy
  • Paged Attention (vLLM)hardNumPy
  • KV Cache Quantization (INT8)mediumNumPy
  • Prefix Caching (Prompt KV Reuse)hardNumPy

Asked at

NumPy

import numpy as np

 

def kv_cache_attention(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?