Given a model and a target max batch/sequence, compute the minimum number of GPUs needed so the total memory (weights + KV + overhead) fits.
Signature: def min_gpus_to_serve(n_params: int, dtype_bytes: int, max_batch: int, max_seq_len: int, n_layers: int, n_kv_heads: int, head_dim: int, gpu_vram_gb: int, activation_overhead_gb: float = 4.0) -> int
Memory components:
n_params * dtype_bytes2 * max_batch * max_seq_len * n_layers * n_kv_heads * head_dim * 2activation_overhead_gb * 1e9Total bytes = sum of the three. GPU bytes = gpu_vram_gb * 1e9.
Return math.ceil(total_bytes / gpu_bytes).
Example: 7B fp16, B=4 S=4096 with KV ~2 GB, overhead 4 GB, on 24 GB GPUs: total ~20 GB -> 1 GPU.
Math
Asked at
Test Results