TorchedUp
ProblemsPremium
TorchedUp
Recommend Parallelism StrategyHard
ProblemsPremium

Recommend Parallelism Strategy

Given a model and a fleet, return a basic recommendation for how to split the work: tensor parallel (TP), data parallel (DP), and ZeRO stage.

Signature: def recommend_parallelism(n_params: int, n_gpus: int, gpu_vram_gb: int, training: bool) -> dict

Return a dict with keys 'tp', 'dp', 'zero_stage' (the latter is 0 for inference).

Heuristic:

  1. Inference (training=False):

    • Compute weight bytes assuming fp16 (2 bytes/param).
    • Pick the smallest tp in {1, 2, 4, 8} that satisfies weight_bytes / tp <= gpu_vram_gb * 1e9. If none works, pick tp = 8.
    • tp must also divide n_gpus; if not, fall back to tp = min(8, n_gpus).
    • dp = n_gpus // tp.
    • zero_stage = 0.
  2. Training (training=True):

    • tp = min(8, n_gpus).
    • dp = n_gpus // tp.
    • With TP, the per-rank model state at 16 bytes/param mixed-precision is 16 * N / tp. If that fits in gpu_vram_gb * 1e9 after leaving 8 GB headroom for activations, use zero_stage = 2; otherwise zero_stage = 3.

Return the dict.

Math

Asked at

Python (numpy)0/3 runs today

Test Results

○inference 7B on 8x 24GB
○inference 70B on 8x 24GB
○training 7B on 16x 80GB
○training 70B on 16x 40GB🔒 Premium
Advertisement