Reproduction

Reproduction note: the speedup is real but very sequence-length dependent

YUyukis· Tokyo Tech· about 2 months ago

Reproduced on an A100. The wall-clock gains grow with sequence length (huge past ~2k tokens) but are modest for short sequences where you're not HBM-bound. Biggest gotcha was that naive PyTorch baselines also vary a lot with dtype (bf16 vs fp16) — make sure the baseline is fair before quoting a speedup multiple.

ML Systems

1 Reply

TBtbeckerabout 2 months ago

Good callout. A lot of reported 'X times faster' numbers don't fix the baseline's dtype/implementation. Reporting tokens/sec at a fixed config is much more honest.