TorchedUp
ProblemsPremium
TorchedUp
KV CacheMedium
ProblemsPremium

KV Cache

During autoregressive generation, a language model generates one token at a time. Without caching, we'd recompute the keys and values for every past token at each step — O(N²) total compute.

The KV Cache stores the keys and values from all past tokens. For each new token, we:

  1. Compute the new key and value
  2. Append them to the cache
  3. Compute attention between the new query and the full cached key/value sequence

Implement kv_cache_attention: given the existing key-value cache (past_keys, past_values) and a new token's (new_key, new_value, query), append the new key/value to the cache and return the attention output for this token.

Signature: def kv_cache_attention(past_keys, past_values, new_key, new_value, query)

  • past_keys: (n_past, d_k)
  • past_values: (n_past, d_v)
  • new_key: (1, d_k) — current token's key
  • new_value: (1, d_v) — current token's value
  • query: (1, d_k) — current token's query
  • Returns: output of shape (1, d_v)

Scale by sqrt(d_k). Use standard softmax attention.

Math

Asked at

Python (numpy)0/3 runs today

Test Results

○uniform attention
○attends to new token
○attends to past token🔒 Premium
Advertisement