TorchedUp
ProblemsPremium
TorchedUp
KV Cache Quantization (INT8)Medium
ProblemsPremium

KV Cache Quantization (INT8 Symmetric)

Implement symmetric INT8 quantization for KV cache tensors to reduce memory usage during inference.

Signature: def quantize_kv_cache(K, V)

  • K: (S, d_head) float array — key cache
  • V: (S, d_head) float array — value cache
  • Returns: (K_q, K_scale, V_q, V_scale)
    • K_q: (S, d_head) — INT8 quantized K (values in [-127, 127])
    • K_scale: float — scale factor for K
    • V_q: (S, d_head) — INT8 quantized V
    • V_scale: float — scale factor for V

Symmetric INT8 Quantization

max_val = 127  (= 2^(bits-1) - 1)

scale = max(|tensor|) / max_val      # per-tensor scale
q = round(tensor / scale)            # round to nearest integer
q = clip(q, -max_val, max_val)       # clamp to int8 range

Dequantize: tensor_approx = q * scale

Error: At most scale / 2 per element (rounding error).

Why Symmetric?

Symmetric quantization keeps zero exactly representable (q=0 maps to 0.0), which is important for attention weights that are often near zero. It uses a single scale per tensor (no zero-point), making it fast on hardware.

Memory Savings

KV cache for INT8 vs FP16: 2× reduction in KV cache memory, enabling 2× longer sequences at the same hardware cost.

Asked at

Python (numpy)0/3 runs today

Test Results

○seed=42, 3×4 K/V — check quantized values and scales
○seed=7, 2×2 K/V
○single value 2.54 — scale=0.02, quantized=127🔒 Premium
Advertisement