Critique
Is the Transformer comparison matched on data and compute?
The headline is parity-or-better with Transformers at linear cost, but for a fair read I'd want the baselines trained on identical tokens with tuned hyperparameters, plus throughput measured on the same hardware. Some of the speed claims depend heavily on the hardware-aware scan implementation — worth separating the architecture win from the kernel win.