Question

Why does masked language modeling help downstream tasks so much?

JIjihoon· about 2 months ago

MLM forces the model to use bidirectional context to fill in masked tokens. Is the downstream benefit mainly from bidirectionality, from the sheer scale of self-supervised pretraining, or from the specific masking recipe (15%, 80/10/10)? Would love pointers to ablations that isolate these.

NLP

1 Reply

Accepted answer

AMamir_rabout 2 months ago

Bidirectionality is the conceptual change vs left-to-right LMs, but later work (RoBERTa) showed a lot of BERT's gains came from training longer on more data and dropping NSP — so 'scale + objective' more than the exact masking ratio.