Critique
Is the O(n²) attention cost really a clear win, as Table 1 implies?
Table 1 frames self-attention's O(n²·d) as a win over recurrence, but the quadratic term dominates quickly for long contexts — which is why a decade of 'efficient attention' work exists. How do people weigh the quadratic cost against the parallelism benefit in real deployments?