RE: LeoThread 2024-11-19 11:14

Part 2/6:

To understand quantization, we need to look at how the internal weights of language models are typically stored. These weights are often represented using floating-point numbers, which can efficiently represent a wide range of values in a compact binary format.

Floating-point numbers use a sign bit, an exponent, and a mantissa (or significand) to encode a value. This allows them to represent both large and small numbers with reasonable precision. The standard 32-bit floating-point format (IEEE 754 single-precision) uses 1 bit for the sign, 8 bits for the exponent, and 23 bits for the mantissa.

However, storing all the weights of a large language model in 32-bit floating-point format can quickly add up to massive file sizes. This is where quantization comes in.