TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

213. KV Cache: Pre-allocated Buffer

Medium

In a real LLM serving runtime (vLLM, TGI, llama.cpp), the KV cache is pre-allocated to max_seq_len rows once at request start. As each new token is decoded, its key and value are written into the next free slot. The buffer is never grown.

A naive implementation that does cache_K = np.vstack([cache_K, new_K]) on every step is an anti-pattern: every vstack allocates a fresh array and copies all previously cached tokens. Across N decode steps that's O(N^2) memory traffic — unacceptable when N is in the thousands.

Signature: def kv_cache_step(cache_K, cache_V, cursor, new_K, new_V) -> (cache_K, cache_V, new_cursor)

  • cache_K: pre-allocated buffer of shape (max_seq_len, d_k)
  • cache_V: pre-allocated buffer of shape (max_seq_len, d_v)
  • cursor: integer index of the next free slot (0 at the start of generation)
  • new_K: shape (d_k,) — the new token's key
  • new_V: shape (d_v,) — the new token's value

Write new_K into row cache_K[cursor] and new_V into row cache_V[cursor], then return (cache_K, cache_V, cursor + 1).

Constraints:

  • Do NOT call np.vstack, np.concatenate, or np.append.
  • Do NOT allocate a new buffer with np.empty_like / np.zeros.
  • Do NOT touch slots past cursor — the unused tail of the buffer must remain whatever it was (typically zeros).

The harness verifies the returned cache_K against the expected buffer state.

Math

cache_K[cursor]←knew​,cache_V[cursor]←vnew​,cursor+=1

Asked at

NumPy

import numpy as np

 

def kv_cache_step(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?