Throughput vs Latency Batch Size

Throughput vs Latency — Batch Size for an SLA

Find the largest batch size that still meets a per-token latency SLA.

Signature: def optimal_batch_for_sla(latency_intercept_ms: float, latency_slope_ms: float, max_latency_ms: float, max_batch: int) -> int

Latency model: l(b) = latency_intercept_ms + latency_slope_ms * b. It is non-decreasing in b (bigger batch = slower per-token decode at high enough b).

Return the maximum integer b in [1, max_batch] such that l(b) <= max_latency_ms. If no b satisfies the SLA, return 0. Use a linear search.

Example:

intercept=5, slope=0.5, max_latency=10ms, max_batch=32 → largest b with 5 + 0.5b <= 10 is b=10.

Math

Asked at