ExplanationRe: Fig. 3
How should we interpret the multi-head attention visualization?
People often read a lot into attention heatmaps. I think it's worth being careful: attention weights show where the model attends, not necessarily what it uses causally. Curious how others interpret Figure 3-style visualizations.