- Optimized training algorithms: The development of optimized training algorithms, such as Adam and RMSProp, has improved the efficiency and speed of LLM training.
- Data augmentation: The use of data augmentation techniques, such as paraphrasing and noise injection, has enabled researchers to train larger language models with smaller datasets.
For example, the training time for the BERT model was around 1-2 weeks, while the training time for the RoBERTa model was around 1-2 days. This represents a significant reduction in training time, which has enabled researchers to train larger and more complex language models.