In a previous study, Nvidia researchers demonstrated the effectiveness of combining pruning with classical knowledge distillation. They started with the Nemotron 15B model and progressively pruned and distilled it down to an 8-billion parameter model. They then performed a light retraining procedure using model distillation with the original model as the teacher and the pruned model as the student. Finally, they repeated the process with the 8B model as the starting point to create a smaller 4B model.
This approach resulted in a 16% improvement in performance on the popular MMLU benchmark compared to training a 4-billion parameter model from scratch. Impressively, the entire process required 40X fewer tokens than training the model from scratch. The model’s performance was comparable to Mistral 7B, Gemma 7B, and Llama-3 8B, which were trained on trillions of tokens.