Given the wall-clock time the GPU spends per step and the time it takes one dataloader worker to produce one batch, compute the milliseconds the GPU is idle per step waiting on data.
Signature: def compute_throughput_gap(gpu_step_ms: float, dataloader_batch_ms: float, num_workers: int) -> float
With num_workers parallel workers the effective per-batch dataloader time is dataloader_batch_ms / num_workers. The wasted time per step is max(0, effective - gpu_step_ms).
Math
Asked at
Test Results