Minimum GPUs to Serve

Given a model and a target max batch/sequence, compute the minimum number of GPUs needed so the total memory (weights + KV + overhead) fits.

Signature: def min_gpus_to_serve(n_params: int, dtype_bytes: int, max_batch: int, max_seq_len: int, n_layers: int, n_kv_heads: int, head_dim: int, gpu_vram_gb: int, activation_overhead_gb: float = 4.0) -> int

Memory components:

Weights: n_params * dtype_bytes
KV cache (fp16, 2 bytes): 2 * max_batch * max_seq_len * n_layers * n_kv_heads * head_dim * 2
Overhead (activation/runtime): activation_overhead_gb * 1e9

Total bytes = sum of the three. GPU bytes = gpu_vram_gb * 1e9.

Return math.ceil(total_bytes / gpu_bytes).

Example: 7B fp16, B=4 S=4096 with KV ~2 GB, overhead 4 GB, on 24 GB GPUs: total ~20 GB -> 1 GPU.

Math

Asked at