Given a VRAM budget for KV cache, compute the max concurrent sequences we can serve at full sequence length.
Signature: def max_concurrent_seqs(vram_for_kv_bytes: int, seq_len: int, n_layers: int, n_kv_heads: int, head_dim: int, dtype_bytes: int) -> int
Per-sequence KV bytes: 2 * seq_len * n_layers * n_kv_heads * head_dim * dtype_bytes (factor 2 for both K and V).
Return: vram_for_kv_bytes // per_seq_bytes.
Example:
Math
Asked at
Test Results