In a real LLM serving runtime (vLLM, TGI, llama.cpp), the KV cache is pre-allocated to max_seq_len rows once at request start. As each new token is decoded, its key and value are written into the next free slot. The buffer is never grown.
A naive implementation that does cache_K = np.vstack([cache_K, new_K]) on every step is an anti-pattern: every vstack allocates a fresh array and copies all previously cached tokens. Across N decode steps that's O(N^2) memory traffic — unacceptable when N is in the thousands.
Signature: def kv_cache_step(cache_K, cache_V, cursor, new_K, new_V) -> (cache_K, cache_V, new_cursor)
cache_K: pre-allocated buffer of shape (max_seq_len, d_k)cache_V: pre-allocated buffer of shape (max_seq_len, d_v)cursor: integer index of the next free slot (0 at the start of generation)new_K: shape (d_k,) — the new token's keynew_V: shape (d_v,) — the new token's valueWrite new_K into row cache_K[cursor] and new_V into row cache_V[cursor], then return (cache_K, cache_V, cursor + 1).
Constraints:
np.vstack, np.concatenate, or np.append.np.empty_like / np.zeros.cursor — the unused tail of the buffer must remain whatever it was (typically zeros).The harness verifies the returned cache_K against the expected buffer state.
Math
Asked at
import numpy as np
def kv_cache_step(...):
pass
Premium problem
Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.
Already premium?