Softmax + Dropout Fusion

In an attention block, P = softmax(scores) is followed by P = dropout(P). Unfused, dropout reads P from HBM and writes a masked copy back. Fused, dropout is applied in the same pass that wrote softmax's output — saving 2 full (N, D) round-trips.

Signature: def softmax_dropout_fusion_bytes(N: int, D: int, dtype_bytes: int) -> list

Unfused: softmax read+write + dropout read+write = 4 * N * D * dtype_bytes
Fused: read scores once, write final output once = 2 * N * D * dtype_bytes

Return [unfused_bytes, fused_bytes, savings_bytes] (all ints).

Math

Asked at