Critique

Is the Transformer comparison matched on data and compute?

TBtbecker· Independent· about 2 months ago

The headline is parity-or-better with Transformers at linear cost, but for a fair read I'd want the baselines trained on identical tokens with tuned hyperparameters, plus throughput measured on the same hardware. Some of the speed claims depend heavily on the hardware-aware scan implementation — worth separating the architecture win from the kernel win.

ML Systems

0 Replies

No replies yet. Start the conversation.