The power of pruning and distillation
Pruning and distillation are two key techniques for creating smaller, more efficient language models. Pruning involves removing less important components of a model. “Depth pruning” removes complete layers while “width pruning” drops specific elements such as neurons and attention heads.
Model distillation is a technique that transfers knowledge and capabilities from a large model—often called the “teacher model”—to a smaller, simpler “student model.” There are two main ways to do distillation. First is “SGD training,” where the student model is trained on the inputs and responses of the teacher. Another method is “classical knowledge distillation,” where in addition to the results, the student is trained on the inner activations of the teacher model.