KV Cache: Pre-allocated Buffer

In a real LLM serving runtime (vLLM, TGI, llama.cpp), the KV cache is pre-allocated to max_seq_len rows once at request start. As each new token is decoded, its key and value are written into the next free slot. The buffer is never grown.

A naive implementation that does cache_K = np.vstack([cache_K, new_K]) on every step is an anti-pattern: every vstack allocates a fresh array and copies all previously cached tokens. Across N decode steps that's O(N^2) memory traffic — unacceptable when N is in the thousands.

Signature: def kv_cache_step(cache_K, cache_V, cursor, new_K, new_V) -> (cache_K, cache_V, new_cursor)

cache_K: pre-allocated buffer of shape (max_seq_len, d_k)
cache_V: pre-allocated buffer of shape (max_seq_len, d_v)
cursor: integer index of the next free slot (0 at the start of generation)
new_K: shape (d_k,) — the new token's key
new_V: shape (d_v,) — the new token's value

Write new_K into row cache_K[cursor] and new_V into row cache_V[cursor], then return (cache_K, cache_V, cursor + 1).

Constraints:

Do NOT call np.vstack, np.concatenate, or np.append.
Do NOT allocate a new buffer with np.empty_like / np.zeros.
Do NOT touch slots past cursor — the unused tail of the buffer must remain whatever it was (typically zeros).

The harness verifies the returned cache_K against the expected buffer state.

Math

cache_K [cursor] \leftarrow k_{new}, cache_V [cursor] \leftarrow v_{new}, cursor + = 1

Asked at