Chunked Prefill

Implement chunked prefill: split a long prompt into fixed-size chunks so prefill tokens can be interleaved with decode tokens in the same batch.

Signature: def chunked_prefill(tokens, chunk_size)

tokens: list of token IDs (the full prompt)
chunk_size: maximum number of tokens per chunk
Returns: list of chunks, each a list of token IDs

Algorithm

Split tokens into contiguous chunks of at most chunk_size tokens:

chunks = []
for i in range(0, len(tokens), chunk_size):
    chunks.append(tokens[i : i + chunk_size])
return chunks

The last chunk may be smaller than chunk_size.

Why Chunked Prefill?

In standard LLM serving:

Prefill: compute KV cache for the whole prompt (parallel, but GPU-intensive)
Decode: generate one token at a time (memory-bandwidth bound)

Without chunking, a long prefill monopolizes the GPU, causing head-of-line blocking for decode requests (high latency). With chunked prefill, a 2048-token prompt becomes 8 chunks of 256 tokens each — decode requests can run between chunks, reducing p99 latency dramatically.

This is the scheduling primitive used in Sarathi-Serve and vLLM v0.3+.

Asked at