TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

86. KV Cache Quantization (INT8)

Medium

Implement symmetric INT8 quantization for KV cache tensors to reduce memory usage during inference.

Signature: def quantize_kv_cache(K, V)

  • K: (S, d_head) float array — key cache
  • V: (S, d_head) float array — value cache
  • Returns: (K_q, K_scale, V_q, V_scale)
    • K_q: (S, d_head) — INT8 quantized K (values in [-127, 127])
    • K_scale: float — scale factor for K
    • V_q: (S, d_head) — INT8 quantized V
    • V_scale: float — scale factor for V

Symmetric INT8 Quantization

INT8 has range [-127, 127] (we drop -128 to keep things symmetric). Compute a single per-tensor scale so that the largest-magnitude entry maps onto the edge of that range, then divide the tensor by the scale, round to the nearest integer, and clamp to the INT8 range. Quantize K and V independently — each has its own scale based on its own maximum absolute value.

Dequantize (for reference): multiply the integer values by the saved scale.

Error: at most scale / 2 per element (rounding error).

Why Symmetric?

Symmetric quantization keeps zero exactly representable (q=0 maps to 0.0), which is important for attention weights that are often near zero. It uses a single scale per tensor (no zero-point), making it fast on hardware.

Memory Savings

KV cache for INT8 vs FP16: 2× reduction in KV cache memory, enabling 2× longer sequences at the same hardware cost.

Asked at

NumPy

import numpy as np

 

def quantize_kv_cache(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?