Most models today are trained at 16-bit or “half precision” and “post-train quantized” to 8-bit precision. Certain model components (e.g., its parameters) are converted to a lower-precision format at the cost of some accuracy. Think of it like doing the math to a few decimal places but then rounding off to the nearest 10th, often giving you the best of both worlds.
Hardware vendors like Nvidia are pushing for lower precision for quantized model inference. The company’s new Blackwell chip supports 4-bit precision, specifically a data type called FP4; Nvidia has pitched this as a boon for memory- and power-constrained data centers.