TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

120. Recommend Parallelism Strategy

Hard

Given a model and a fleet, return a basic recommendation for how to split the work: tensor parallel (TP), data parallel (DP), and ZeRO stage.

Signature: def recommend_parallelism(n_params: int, n_gpus: int, gpu_vram_gb: int, training: bool) -> dict

Return a dict with keys 'tp', 'dp', 'zero_stage' (the latter is 0 for inference).

Heuristic:

  1. Inference (training=False):

    • Compute weight bytes assuming fp16 (2 bytes/param).
    • Pick the smallest tp in {1, 2, 4, 8} that satisfies weight_bytes / tp <= gpu_vram_gb * 1e9. If none works, pick tp = 8.
    • tp must also divide n_gpus; if not, fall back to tp = min(8, n_gpus).
    • dp = n_gpus // tp.
    • zero_stage = 0.
  2. Training (training=True):

    • tp = min(8, n_gpus).
    • dp = n_gpus // tp.
    • With TP, the per-rank model state at 16 bytes/param mixed-precision is 16 * N / tp. If that fits in gpu_vram_gb * 1e9 after leaving 8 GB headroom for activations, use zero_stage = 2; otherwise zero_stage = 3.

Return the dict.

Math

tp⋅dp=W,stage∈{0,2,3}

Asked at

NumPy

import numpy as np

 

def recommend_parallelism(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?