Reproduction
Reproduction note: the speedup is real but very sequence-length dependent
Reproduced on an A100. The wall-clock gains grow with sequence length (huge past ~2k tokens) but are modest for short sequences where you're not HBM-bound. Biggest gotcha was that naive PyTorch baselines also vary a lot with dtype (bf16 vs fp16) — make sure the baseline is fair before quoting a speedup multiple.