Back to paper
Critique

Is the Transformer comparison matched on data and compute?

TBtbecker· 13 days ago

The headline is parity-or-better with Transformers at linear cost, but for a fair read I'd want the baselines trained on identical tokens with tuned hyperparameters, plus throughput measured on the same hardware. Some of the speed claims depend heavily on the hardware-aware scan implementation — worth separating the architecture win from the kernel win.

0 Replies

Sign in to reply and react.

No replies yet. Start the conversation.