PD Disaggregation (Prefill/Decode Separation)

Implement a PD disaggregation router that assigns incoming LLM requests to separate prefill pools and decode pools based on request state.

Signature: def pd_router(requests, n_prefill_workers, n_decode_workers)

requests: list of dicts, each with:
- 'id': unique request ID
- 'phase': 'prefill' or 'decode'
- 'prompt_tokens': number of prompt tokens
n_prefill_workers: number of prefill workers
n_decode_workers: number of decode workers
Returns: dict mapping worker IDs to lists of request IDs
- 'prefill_0', 'prefill_1', ..., 'prefill_{n-1}'
- 'decode_0', 'decode_1', ..., 'decode_{n-1}'

Routing Strategy: Round-Robin within Pool

prefill_workers = ['prefill_0', 'prefill_1', ...]
decode_workers  = ['decode_0',  'decode_1',  ...]

# Assign requests round-robin within each pool:
prefill_requests = [r for r in requests if r['phase'] == 'prefill']
decode_requests  = [r for r in requests if r['phase'] == 'decode']

for i, req in enumerate(prefill_requests):
    worker = prefill_workers[i % n_prefill_workers]
    worker_assignment[worker].append(req['id'])

for i, req in enumerate(decode_requests):
    worker = decode_workers[i % n_decode_workers]
    worker_assignment[worker].append(req['id'])

Why PD Disaggregation?

Prefill (prompt processing) is compute-bound — it benefits from tensor parallelism and high FLOPs. Decode (token generation) is memory-bandwidth bound — it benefits from batching and faster KV cache access. By separating them onto specialized hardware, systems like Mooncake, DistServe, and Splitwise achieve 2-3× better throughput.

Asked at