TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

119. Minimum GPUs to Serve

Hard

Given a model and a target max batch/sequence, compute the minimum number of GPUs needed so the total memory (weights + KV + overhead) fits.

Signature: def min_gpus_to_serve(n_params: int, dtype_bytes: int, max_batch: int, max_seq_len: int, n_layers: int, n_kv_heads: int, head_dim: int, gpu_vram_gb: int, activation_overhead_gb: float = 4.0) -> int

Memory components:

  • Weights: n_params * dtype_bytes
  • KV cache (fp16, 2 bytes): 2 * max_batch * max_seq_len * n_layers * n_kv_heads * head_dim * 2
  • Overhead (activation/runtime): activation_overhead_gb * 1e9

Total bytes = sum of the three. GPU bytes = gpu_vram_gb * 1e9.

Return math.ceil(total_bytes / gpu_bytes).

Example: 7B fp16, B=4 S=4096 with KV ~2 GB, overhead 4 GB, on 24 GB GPUs: total ~20 GB -> 1 GPU.

Math

n=⌈VGPU​MW​+MKV​+Moverhead​​⌉

Asked at

NumPy

import numpy as np

 

def min_gpus_to_serve(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?