Part 2/6:
To understand quantization, we need to look at how the internal weights of language models are typically stored. These weights are often represented using floating-point numbers, which can efficiently represent a wide range of values in a compact binary format.
Floating-point numbers use a sign bit, an exponent, and a mantissa (or significand) to encode a value. This allows them to represent both large and small numbers with reasonable precision. The standard 32-bit floating-point format (IEEE 754 single-precision) uses 1 bit for the sign, 8 bits for the exponent, and 23 bits for the mantissa.
However, storing all the weights of a large language model in 32-bit floating-point format can quickly add up to massive file sizes. This is where quantization comes in.