Implement chunked prefill: split a long prompt into fixed-size chunks so prefill tokens can be interleaved with decode tokens in the same batch.
Signature: def chunked_prefill(tokens, chunk_size)
tokens: list of token IDs (the full prompt)chunk_size: maximum number of tokens per chunkSplit tokens into contiguous chunks of at most chunk_size tokens:
chunks = []
for i in range(0, len(tokens), chunk_size):
chunks.append(tokens[i : i + chunk_size])
return chunks
The last chunk may be smaller than chunk_size.
In standard LLM serving:
Without chunking, a long prefill monopolizes the GPU, causing head-of-line blocking for decode requests (high latency). With chunked prefill, a 2048-token prompt becomes 8 chunks of 256 tokens each — decode requests can run between chunks, reducing p99 latency dramatically.
This is the scheduling primitive used in Sarathi-Serve and vLLM v0.3+.
Asked at
Test Results