In an attention block, P = softmax(scores) is followed by P = dropout(P). Both tensors are shape (N, D). Unfused, the softmax kernel reads scores and writes P to HBM, then dropout reads P back and writes the masked copy — two full HBM round-trips. Fused, dropout is applied in the same pass that wrote softmax's output, so we read scores once and write the final P once.
Signature: def softmax_dropout_fusion_bytes(N: int, D: int, dtype_bytes: int) -> list
Count DRAM traffic in elements (multiply by dtype_bytes for bytes) for both versions and return [unfused_bytes, fused_bytes, savings_bytes] (all ints).
Math
Asked at
import numpy as np
def softmax_dropout_fusion_bytes(...):
pass
Premium problem
Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.
Already premium?