Question
Why does masked language modeling help downstream tasks so much?
MLM forces the model to use bidirectional context to fill in masked tokens. Is the downstream benefit mainly from bidirectionality, from the sheer scale of self-supervised pretraining, or from the specific masking recipe (15%, 80/10/10)? Would love pointers to ablations that isolate these.