Flash Attention (Tiled)

Standard attention computes the full N×N attention matrix, requiring O(N²) memory. Flash Attention rewrites the computation using tiling — processing blocks of queries against blocks of keys/values — and maintains running softmax statistics to produce the exact same output in O(N) memory.

Signature: def flash_attention(Q: np.ndarray, K: np.ndarray, V: np.ndarray, block_size: int = 2) -> np.ndarray

You must NOT materialize the full N×N attention matrix. Instead, iterate over blocks of the key/value sequence and incrementally accumulate the attention output, using an online softmax that tracks the running per-query max and normalizing constant. The math reference summarises the update; the algorithmic details are up to you.

Math

Online softmax update: m_{i}^{n e w} = max (m_{i}^{p r e v}, j max S_{ij}) P_{ij} = exp (S_{ij} - m_{i}^{n e w}), S_{ij} = \frac{Q _{i} K _{j}^{T}}{d} O_{i} \leftarrow \frac{O _{i} \cdot e ^{m_{i}^{p r e v} - m_{i}^{n e w}} \cdot l _{i} + P _{ij} V _{j}}{e ^{m_{i}^{p r e v} - m_{i}^{n e w}} \cdot l _{i} + \sum P _{ij}}

27. Flash Attention (Tiled)

27. Flash Attention (Tiled)