Critique

Is the O(n²) attention cost really a clear win, as Table 1 implies?

TBtbecker· Independent· about 2 months ago

Table 1 frames self-attention's O(n²·d) as a win over recurrence, but the quadratic term dominates quickly for long contexts — which is why a decade of 'efficient attention' work exists. How do people weigh the quadratic cost against the parallelism benefit in real deployments?

ML Systems Theory

2 Replies

LElenafabout 2 months ago

For short-to-medium sequences parallelism wins on wall-clock even when FLOPs are higher. The quadratic cost only bites past a few thousand tokens — exactly where the efficient-attention work targets.

SEseoyeonabout 2 months ago

맞아요. n이 수천 토큰을 넘기면 attention map 메모리가 더 큰 병목이 되더라고요. FlashAttention 같은 IO-aware 구현이 나오면서 체감 병목이 많이 바뀌었어요.