TorchedUp
ProblemsPremium
TorchedUp
KV Cache Memory BudgetMedium
ProblemsPremium

KV Cache Memory Budget

Given a VRAM budget for KV cache, compute the max concurrent sequences we can serve at full sequence length.

Signature: def max_concurrent_seqs(vram_for_kv_bytes: int, seq_len: int, n_layers: int, n_kv_heads: int, head_dim: int, dtype_bytes: int) -> int

Per-sequence KV bytes: 2 * seq_len * n_layers * n_kv_heads * head_dim * dtype_bytes (factor 2 for both K and V).

Return: vram_for_kv_bytes // per_seq_bytes.

Example:

  • 40 GB for KV, seq_len=4096, 32 layers, 8 KV heads (GQA), head_dim=128, fp16 (2 bytes) → per-seq = 24096328128*2 = 536870912 bytes = 512 MB → 80 concurrent seqs.

Math

Asked at

Python (numpy)0/3 runs today

Test Results

○llama-style GQA
○short ctx, small model
○fp8 cache🔒 Premium
Advertisement