TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

144. KV Cache Memory Budget

Medium

Given a VRAM budget for KV cache, compute the max concurrent sequences we can serve at full sequence length.

Signature: def max_concurrent_seqs(vram_for_kv_bytes: int, seq_len: int, n_layers: int, n_kv_heads: int, head_dim: int, dtype_bytes: int) -> int

Per-sequence KV bytes: 2 * seq_len * n_layers * n_kv_heads * head_dim * dtype_bytes (factor 2 for both K and V).

Return: vram_for_kv_bytes // per_seq_bytes.

Example:

  • 40 GB for KV, seq_len=4096, 32 layers, 8 KV heads (GQA), head_dim=128, fp16 (2 bytes) → per-seq = 24096328128*2 = 536870912 bytes = 512 MB → 80 concurrent seqs.

Math

Nseq​=⌊2⋅L⋅H⋅K⋅D⋅bV​⌋

Asked at

NumPy

import numpy as np

 

def max_concurrent_seqs(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?