Implement a PD disaggregation router that assigns incoming LLM requests to separate prefill pools and decode pools based on request state.
Signature: def pd_router(requests, n_prefill_workers, n_decode_workers)
requests: list of dicts, each with:
'id': unique request ID'phase': 'prefill' or 'decode''prompt_tokens': number of prompt tokensn_prefill_workers: number of prefill workersn_decode_workers: number of decode workers'prefill_0', 'prefill_1', ..., 'prefill_{n-1}''decode_0', 'decode_1', ..., 'decode_{n-1}'prefill_workers = ['prefill_0', 'prefill_1', ...]
decode_workers = ['decode_0', 'decode_1', ...]
# Assign requests round-robin within each pool:
prefill_requests = [r for r in requests if r['phase'] == 'prefill']
decode_requests = [r for r in requests if r['phase'] == 'decode']
for i, req in enumerate(prefill_requests):
worker = prefill_workers[i % n_prefill_workers]
worker_assignment[worker].append(req['id'])
for i, req in enumerate(decode_requests):
worker = decode_workers[i % n_decode_workers]
worker_assignment[worker].append(req['id'])
Prefill (prompt processing) is compute-bound — it benefits from tensor parallelism and high FLOPs. Decode (token generation) is memory-bandwidth bound — it benefits from batching and faster KV cache access. By separating them onto specialized hardware, systems like Mooncake, DistServe, and Splitwise achieve 2-3× better throughput.
Asked at
Test Results