Kernel Launch Overhead Speedup

Each CUDA kernel launch carries a few microseconds of fixed overhead. For chains of tiny pointwise ops, this overhead dominates over actual GPU work. Fusing n small kernels into one pays the launch cost once.

Signature: def kernel_launch_breakeven(per_kernel_overhead_us: float, fused_kernel_us: float, n_ops: int) -> list

Unfused total time: per_kernel_overhead_us * n_ops (microseconds)
Fused total time: fused_kernel_us (microseconds)
Speedup = unfused / fused (or 0.0 if fused is 0)

Return [unfused_us, fused_us, speedup] (floats).

Math

Asked at