TorchedUp
ProblemsPremium
TorchedUp
Chunked PrefillMedium
ProblemsPremium

Chunked Prefill

Implement chunked prefill: split a long prompt into fixed-size chunks so prefill tokens can be interleaved with decode tokens in the same batch.

Signature: def chunked_prefill(tokens, chunk_size)

  • tokens: list of token IDs (the full prompt)
  • chunk_size: maximum number of tokens per chunk
  • Returns: list of chunks, each a list of token IDs

Algorithm

Split tokens into contiguous chunks of at most chunk_size tokens:

chunks = []
for i in range(0, len(tokens), chunk_size):
    chunks.append(tokens[i : i + chunk_size])
return chunks

The last chunk may be smaller than chunk_size.


Why Chunked Prefill?

In standard LLM serving:

  • Prefill: compute KV cache for the whole prompt (parallel, but GPU-intensive)
  • Decode: generate one token at a time (memory-bandwidth bound)

Without chunking, a long prefill monopolizes the GPU, causing head-of-line blocking for decode requests (high latency). With chunked prefill, a 2048-token prompt becomes 8 chunks of 256 tokens each — decode requests can run between chunks, reducing p99 latency dramatically.

This is the scheduling primitive used in Sarathi-Serve and vLLM v0.3+.

Asked at

Python (numpy)0/3 runs today

Test Results

○8 tokens, chunk_size=3 → [[0,1,2],[3,4,5],[6,7]]
○tokens shorter than chunk_size → single chunk
○chunk_size=2 → pairs
○single token sequence🔒 Premium
Advertisement