In an attention block, P = softmax(scores) is followed by P = dropout(P). Unfused, dropout reads P from HBM and writes a masked copy back. Fused, dropout is applied in the same pass that wrote softmax's output — saving 2 full (N, D) round-trips.
Signature: def softmax_dropout_fusion_bytes(N: int, D: int, dtype_bytes: int) -> list
4 * N * D * dtype_bytes2 * N * D * dtype_bytesReturn [unfused_bytes, fused_bytes, savings_bytes] (all ints).
Math
Asked at
Test Results