Back to paper
Question

Why does masked language modeling help downstream tasks so much?

JIjihoon· 11 days ago

MLM forces the model to use bidirectional context to fill in masked tokens. Is the downstream benefit mainly from bidirectionality, from the sheer scale of self-supervised pretraining, or from the specific masking recipe (15%, 80/10/10)? Would love pointers to ablations that isolate these.

1 Reply

Sign in to reply and react.

Accepted answer

AMamir_r11 days ago

Bidirectionality is the conceptual change vs left-to-right LMs, but later work (RoBERTa) showed a lot of BERT's gains came from training longer on more data and dropping NSP — so 'scale + objective' more than the exact masking ratio.