Continuous Batching Scheduler

Implement a continuous batching scheduler that packs variable-length requests into batches without exceeding a token budget.

This is the core scheduling algorithm behind high-throughput LLM serving systems like vLLM and TGI. Unlike static batching (fixed batch size), continuous batching packs as many requests as possible into each batch based on their token lengths.

Signature: def continuous_batching_scheduler(requests, max_batch_tokens)

requests: list of dicts, each with:
- 'id': unique request identifier (int)
- 'tokens': number of tokens for this request (int)
max_batch_tokens: maximum total tokens allowed per batch (int)
Returns: list of batches, each batch is a list of request ids

Algorithm: Greedy First-Fit

Process requests in order. For each request:

If it fits in the current batch (current_tokens + req.tokens ≤ max_batch_tokens), add it.
Otherwise, flush the current batch (save it), start a new batch with this request.
After all requests, flush any remaining batch.

Important: A request that is too large to share (its tokens > max_batch_tokens) still gets its own batch — add it alone even if it exceeds the limit.

batch = []
budget = 0
for req in requests:
    if budget + req.tokens > max_batch_tokens and batch is not empty:
        yield batch  # flush
        batch = [], budget = 0
    batch.append(req.id)
    budget += req.tokens
if batch: yield batch  # flush remainder

Why This Matters

Static batching waits for a full fixed-size batch, wasting GPU cycles when requests vary in length. Continuous batching fills the GPU's compute budget dynamically — a 100-token request and a 200-token request fit together in a 300-token budget, while a 400-token request gets its own batch. Real systems (vLLM, TGI) use more sophisticated variants with preemption and priority queues, but greedy first-fit is the foundation.

Asked at

batch = [] budget = 0 for req in requests: if budget + req.tokens > max_batch_tokens and batch is not empty: yield batch # flush batch = [], budget = 0 batch.append(req.id) budget += req.tokens if batch: yield batch # flush remainder

34. Continuous Batching Scheduler

Algorithm: Greedy First-Fit

Why This Matters

34. Continuous Batching Scheduler

Algorithm: Greedy First-Fit

Why This Matters