Find the largest batch size that still meets a per-token latency SLA.
Signature: def optimal_batch_for_sla(latency_intercept_ms: float, latency_slope_ms: float, max_latency_ms: float, max_batch: int) -> int
Latency model: l(b) = latency_intercept_ms + latency_slope_ms * b. It is non-decreasing in b (bigger batch = slower per-token decode at high enough b).
Return the maximum integer b in [1, max_batch] such that l(b) <= max_latency_ms. If no b satisfies the SLA, return 0. Use a linear search.
Example:
intercept=5, slope=0.5, max_latency=10ms, max_batch=32 → largest b with 5 + 0.5b <= 10 is b=10.Math
Asked at
Test Results