KV Cache in a Decoder Loop

Implement the per-layer KV cache that a Transformer decoder uses to support autoregressive generation. For each step, every layer calls cache.append(layer_idx, k, v) to extend its running K/V tensors and read them back for attention.

This is a Build-in-Context problem. Your append is exercised by a hidden test runner that simulates a decoder loop — multiple layers, many steps, varied per-layer K/V values — and compares the resulting cache state against the reference. The signature contract has to hold for the loop to work; wrong shape, dtype, axis, or per-layer mixing all surface as a diverged cache value.

Contract

KVCache.append(layer_idx, k, v) should append k and v (each shape (d_model,)) to the running cache for layer_idx and return the full (seq_len + 1, d_model) tensors. Cross-layer state must stay independent — appending to layer 0 doesn't touch layer 1's cache.

Math

K_{t}^{(ℓ)} = [K_{t - 1}^{(ℓ)}; k_{t}^{(ℓ)}], V_{t}^{(ℓ)} = [V_{t - 1}^{(ℓ)}; v_{t}^{(ℓ)}]

Asked at

Contract

Math

K_{t}^{(ℓ)} = [K_{t - 1}^{(ℓ)}; k_{t}^{(ℓ)}], V_{t}^{(ℓ)} = [V_{t - 1}^{(ℓ)}; v_{t}^{(ℓ)}]

Asked at

270. KV Cache in a Decoder Loop

270. KV Cache in a Decoder Loop