2. Quality Degradation Over Generations
- Compounding errors: A 2023 study by researchers at Rice University and Stanford found that over-reliance on synthetic data can lead to models with decreasing quality or diversity over successive generations of training.
- Sampling bias: Poor representation of the real world in synthetic data can cause a model's diversity to worsen after multiple generations of training.
- Mitigation strategy: The study suggests that mixing in real-world data helps to counteract this degradation effect.