TorchedUp
ProblemsPremium
TorchedUp
PD Disaggregation (Prefill/Decode Separation)Hard
ProblemsPremium

PD Disaggregation (Prefill/Decode Separation)

Implement a PD disaggregation router that assigns incoming LLM requests to separate prefill pools and decode pools based on request state.

Signature: def pd_router(requests, n_prefill_workers, n_decode_workers)

  • requests: list of dicts, each with:
    • 'id': unique request ID
    • 'phase': 'prefill' or 'decode'
    • 'prompt_tokens': number of prompt tokens
  • n_prefill_workers: number of prefill workers
  • n_decode_workers: number of decode workers
  • Returns: dict mapping worker IDs to lists of request IDs
    • 'prefill_0', 'prefill_1', ..., 'prefill_{n-1}'
    • 'decode_0', 'decode_1', ..., 'decode_{n-1}'

Routing Strategy: Round-Robin within Pool

prefill_workers = ['prefill_0', 'prefill_1', ...]
decode_workers  = ['decode_0',  'decode_1',  ...]

# Assign requests round-robin within each pool:
prefill_requests = [r for r in requests if r['phase'] == 'prefill']
decode_requests  = [r for r in requests if r['phase'] == 'decode']

for i, req in enumerate(prefill_requests):
    worker = prefill_workers[i % n_prefill_workers]
    worker_assignment[worker].append(req['id'])

for i, req in enumerate(decode_requests):
    worker = decode_workers[i % n_decode_workers]
    worker_assignment[worker].append(req['id'])

Why PD Disaggregation?

Prefill (prompt processing) is compute-bound — it benefits from tensor parallelism and high FLOPs. Decode (token generation) is memory-bandwidth bound — it benefits from batching and faster KV cache access. By separating them onto specialized hardware, systems like Mooncake, DistServe, and Splitwise achieve 2-3× better throughput.

Asked at

Python (numpy)0/3 runs today

Test Results

○4 requests mixed, 2 prefill workers, 1 decode worker
○all prefill requests, 2 workers → round-robin
○single request, single worker of each type🔒 Premium
Advertisement