Each CUDA kernel launch carries a few microseconds of fixed overhead. For chains of tiny pointwise ops, this overhead dominates over actual GPU work. Fusing n small kernels into one pays the launch cost once.
Signature: def kernel_launch_breakeven(per_kernel_overhead_us: float, fused_kernel_us: float, n_ops: int) -> list
per_kernel_overhead_us * n_ops (microseconds)fused_kernel_us (microseconds)Return [unfused_us, fused_us, speedup] (floats).
Math
Asked at
Test Results