KV Cache Quantization (INT8)

KV Cache Quantization (INT8 Symmetric)

Implement symmetric INT8 quantization for KV cache tensors to reduce memory usage during inference.

Signature: def quantize_kv_cache(K, V)

K: (S, d_head) float array — key cache
V: (S, d_head) float array — value cache
Returns: (K_q, K_scale, V_q, V_scale)
- K_q: (S, d_head) — INT8 quantized K (values in [-127, 127])
- K_scale: float — scale factor for K
- V_q: (S, d_head) — INT8 quantized V
- V_scale: float — scale factor for V

Symmetric INT8 Quantization

max_val = 127  (= 2^(bits-1) - 1)

scale = max(|tensor|) / max_val      # per-tensor scale
q = round(tensor / scale)            # round to nearest integer
q = clip(q, -max_val, max_val)       # clamp to int8 range

Dequantize: tensor_approx = q * scale

Error: At most scale / 2 per element (rounding error).

Why Symmetric?

Symmetric quantization keeps zero exactly representable (q=0 maps to 0.0), which is important for attention weights that are often near zero. It uses a single scale per tensor (no zero-point), making it fast on hardware.

Memory Savings

KV cache for INT8 vs FP16: 2× reduction in KV cache memory, enabling 2× longer sequences at the same hardware cost.

Asked at