Implement a continuous batching scheduler that packs variable-length requests into batches without exceeding a token budget.
This is the core scheduling algorithm behind high-throughput LLM serving systems like vLLM and TGI. Unlike static batching (fixed batch size), continuous batching packs as many requests as possible into each batch based on their token lengths.
Signature: def continuous_batching_scheduler(requests, max_batch_tokens)
requests: list of dicts, each with:
'id': unique request identifier (int)'tokens': number of tokens for this request (int)max_batch_tokens: maximum total tokens allowed per batch (int)Process requests in order. For each request:
current_tokens + req.tokens ≤ max_batch_tokens), add it.Important: A request that is too large to share (its tokens > max_batch_tokens) still gets its own batch — add it alone even if it exceeds the limit.
batch = []
budget = 0
for req in requests:
if budget + req.tokens > max_batch_tokens and batch is not empty:
yield batch # flush
batch = [], budget = 0
batch.append(req.id)
budget += req.tokens
if batch: yield batch # flush remainder
Static batching waits for a full fixed-size batch, wasting GPU cycles when requests vary in length. Continuous batching fills the GPU's compute budget dynamically — a 100-token request and a 200-token request fit together in a 300-token budget, while a 400-token request gets its own batch. Real systems (vLLM, TGI) use more sophisticated variants with preemption and priority queues, but greedy first-fit is the foundation.
Asked at
import numpy as np
def continuous_batching_scheduler(...):
pass
Premium problem
Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.
Already premium?