QuestionRe: Section 3.2.1 · Eq. 1

Why does the attention mechanism scale by sqrt(d_k)?

AMamir_r· Stanford NLP· about 2 months ago

In the scaled dot-product attention, the dot products are divided by sqrt(d_k). I understand it's about keeping the softmax in a sensible range, but I'd like a clearer intuition.

My understanding: for large d_k, the dot product of two random vectors grows with variance proportional to d_k, which pushes softmax into saturated regions with vanishing gradients. Is that the whole story, or is there more to it?

NLP Theory

2 Replies

LElenafabout 2 months ago

That's essentially it. If query/key components are independent with unit variance, the dot product over d_k dims has variance d_k. Dividing by sqrt(d_k) renormalizes it back to unit variance so the softmax temperature stays roughly constant across model sizes.

AMamir_rabout 2 months ago

Makes sense — so it's effectively a temperature correction tied to dimensionality. Thanks!