Part 1/6:
Quantization: Compressing Language Models for Efficient Inference
Reducing Model Size and Resource Requirements
Large language models like the 70.6 billion parameter 3.1 Neotron model from Nvidia can take up a significant amount of storage space. The original model files can be over 30 GB in size. However, quantized versions of these models can dramatically reduce the file size and resource requirements.
For example, the 3.1 Neotron model has a quantized 4-bit version that is only 37.4 billion parameters and takes up just 8 files of around 5 GB each. This process of quantization is a technique used to map the original high-precision weights and activations to a smaller, lower-precision data type.