Total Inference Memory

Given the three main components of inference memory, return the total in gigabytes (using the GB = 1e9 bytes convention).

Signature: def inference_memory_gb(n_params: int, dtype_bytes: int, kv_cache_bytes: int, activation_buffer_gb: float = 1.0) -> float

Formula:

total_gb = (n_params * dtype_bytes + kv_cache_bytes) / 1e9 + activation_buffer_gb

The three terms:

Weights: n_params * dtype_bytes
KV cache: precomputed as kv_cache_bytes
Activation/runtime buffer: a fixed overhead (CUDA context, framework workspace, attention scratch).

Example: 7B params in fp16 (2 bytes) with 0 KV and 1 GB buffer -> (7e9 * 2 + 0) / 1e9 + 1.0 = 15.0 GB.

Math

Asked at

Given the three main components of inference memory, return the total in gigabytes (using the GB = 1e9 bytes convention).

Signature: def inference_memory_gb(n_params: int, dtype_bytes: int, kv_cache_bytes: int, activation_buffer_gb: float = 1.0) -> float

Formula:

total_gb = (n_params * dtype_bytes + kv_cache_bytes) / 1e9 + activation_buffer_gb

The three terms:

Weights: n_params * dtype_bytes
KV cache: precomputed as kv_cache_bytes
Activation/runtime buffer: a fixed overhead (CUDA context, framework workspace, attention scratch).

Example: 7B params in fp16 (2 bytes) with 0 KV and 1 GB buffer -> (7e9 * 2 + 0) / 1e9 + 1.0 = 15.0 GB.

Math

Asked at