Overlap Compute and Comm

With overlap: max(compute_ms, comm_ms) — whichever is the bottleneck.
Without overlap: compute_ms + comm_ms — strictly sequential.

In a pipelined training step, gradient AllReduce can run concurrently with backward compute. Compute the per-step time with and without overlap.

Signature: def pipeline_step_time(compute_ms: float, comm_ms: float, overlap: bool) -> float