Implement symmetric INT8 quantization for KV cache tensors to reduce memory usage during inference.
Signature: def quantize_kv_cache(K, V)
K: (S, d_head) float array — key cacheV: (S, d_head) float array — value cache(K_q, K_scale, V_q, V_scale)
K_q: (S, d_head) — INT8 quantized K (values in [-127, 127])K_scale: float — scale factor for KV_q: (S, d_head) — INT8 quantized VV_scale: float — scale factor for VINT8 has range [-127, 127] (we drop -128 to keep things symmetric). Compute a single per-tensor scale so that the largest-magnitude entry maps onto the edge of that range, then divide the tensor by the scale, round to the nearest integer, and clamp to the INT8 range. Quantize K and V independently — each has its own scale based on its own maximum absolute value.
Dequantize (for reference): multiply the integer values by the saved scale.
Error: at most scale / 2 per element (rounding error).
Symmetric quantization keeps zero exactly representable (q=0 maps to 0.0), which is important for attention weights that are often near zero. It uses a single scale per tensor (no zero-point), making it fast on hardware.
KV cache for INT8 vs FP16: 2× reduction in KV cache memory, enabling 2× longer sequences at the same hardware cost.
Asked at
import numpy as np
def quantize_kv_cache(...):
pass
Premium problem
Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.
Already premium?