CritiqueRe: Table 1
Do zero-shot numbers overstate real-world robustness?
Zero-shot accuracy is impressive, but prompt engineering and dataset overlap could inflate it. A fairer protocol would fix prompts in advance and audit train/test overlap. Has anyone quantified the prompt-selection effect?