Why does the attention mechanism scale by sqrt(d_k)?
In the scaled dot-product attention, the dot products are divided by sqrt(d_k). I understand it's about keeping the softmax in a sensible range, but I'd like a clearer intuition.
My understanding: for large d_k, the dot product of two random vectors grows with variance proportional to d_k, which pushes softmax into saturated regions with vanishing gradients. Is that the whole story, or is there more to it?