Explanation
What BERT actually changed vs earlier contextual embeddings
ELMo gave contextual embeddings but from shallow LSTM features you mostly fed into a task model. BERT's shift is fine-tuning the whole deep Transformer end-to-end with a tiny task head — the representation and the task adaptation are unified. That's the part that made it a default starting point.