Implement symmetric INT8 quantization for KV cache tensors to reduce memory usage during inference.
Signature: def quantize_kv_cache(K, V)
K: (S, d_head) float array — key cacheV: (S, d_head) float array — value cache(K_q, K_scale, V_q, V_scale)
K_q: (S, d_head) — INT8 quantized K (values in [-127, 127])K_scale: float — scale factor for KV_q: (S, d_head) — INT8 quantized VV_scale: float — scale factor for Vmax_val = 127 (= 2^(bits-1) - 1)
scale = max(|tensor|) / max_val # per-tensor scale
q = round(tensor / scale) # round to nearest integer
q = clip(q, -max_val, max_val) # clamp to int8 range
Dequantize: tensor_approx = q * scale
Error: At most scale / 2 per element (rounding error).
Symmetric quantization keeps zero exactly representable (q=0 maps to 0.0), which is important for attention weights that are often near zero. It uses a single scale per tensor (no zero-point), making it fast on hardware.
KV cache for INT8 vs FP16: 2× reduction in KV cache memory, enabling 2× longer sequences at the same hardware cost.
Asked at
Test Results