Explanation
The key idea that's easy to miss: recomputation is a feature, not a cost
Counterintuitively, FlashAttention recomputes parts of attention in the backward pass instead of storing them. That trades a bit of extra FLOPs for far fewer slow HBM accesses — and since attention is memory-bound, the trade is a net win. Once that clicks, the whole design makes sense.