In a pipelined training step, gradient AllReduce can run concurrently with backward compute. Compute the per-step time with and without overlap.
Signature: def pipeline_step_time(compute_ms: float, comm_ms: float, overlap: bool) -> float
max(compute_ms, comm_ms) — whichever is the bottleneck.compute_ms + comm_ms — strictly sequential.Example:
Math
Asked at
Test Results