TorchedUp
ProblemsPremium
TorchedUp
Kernel Launch Overhead SpeedupEasy
ProblemsPremium

Kernel Launch Overhead Speedup

Each CUDA kernel launch carries a few microseconds of fixed overhead. For chains of tiny pointwise ops, this overhead dominates over actual GPU work. Fusing n small kernels into one pays the launch cost once.

Signature: def kernel_launch_breakeven(per_kernel_overhead_us: float, fused_kernel_us: float, n_ops: int) -> list

  • Unfused total time: per_kernel_overhead_us * n_ops (microseconds)
  • Fused total time: fused_kernel_us (microseconds)
  • Speedup = unfused / fused (or 0.0 if fused is 0)

Return [unfused_us, fused_us, speedup] (floats).

Math

Asked at

Python (numpy)0/3 runs today

Test Results

○10 ops, 5us each, fused 8us
○no fusion benefit
○big chain🔒 Premium
Advertisement