TorchedUp
ProblemsPremium
TorchedUp
Total Inference MemoryMedium
ProblemsPremium

Total Inference Memory

Given the three main components of inference memory, return the total in gigabytes (using the GB = 1e9 bytes convention).

Signature: def inference_memory_gb(n_params: int, dtype_bytes: int, kv_cache_bytes: int, activation_buffer_gb: float = 1.0) -> float

Formula:

total_gb = (n_params * dtype_bytes + kv_cache_bytes) / 1e9 + activation_buffer_gb

The three terms:

  1. Weights: n_params * dtype_bytes
  2. KV cache: precomputed as kv_cache_bytes
  3. Activation/runtime buffer: a fixed overhead (CUDA context, framework workspace, attention scratch).

Example: 7B params in fp16 (2 bytes) with 0 KV and 1 GB buffer -> (7e9 * 2 + 0) / 1e9 + 1.0 = 15.0 GB.

Math

Asked at

Python (numpy)0/3 runs today

Test Results

○7B fp16 no KV
○13B fp16 with 2GB KV
○70B int4🔒 Premium
Advertisement