Back to paper
Reproduction

Reproduction note: the speedup is real but very sequence-length dependent

YUyukis· 12 days ago

Reproduced on an A100. The wall-clock gains grow with sequence length (huge past ~2k tokens) but are modest for short sequences where you're not HBM-bound. Biggest gotcha was that naive PyTorch baselines also vary a lot with dtype (bf16 vs fp16) — make sure the baseline is fair before quoting a speedup multiple.

1 Reply

Sign in to reply and react.
TBtbecker12 days ago

Good callout. A lot of reported 'X times faster' numbers don't fix the baseline's dtype/implementation. Reporting tokens/sec at a fixed config is much more honest.