Question
Is the novelty algorithmic, hardware-aware, or both?
The attention math is unchanged (it's exact), so the contribution is really about the memory hierarchy — tiling + recomputation to avoid HBM round-trips. Is it fair to call this an 'algorithm' paper, or is the right framing 'systems / IO-aware implementation'? Curious how people categorize it.