The effects are already manifesting. A few months ago, developers and academics reported that quantizing Meta’s Llama 3 model tended to be “more harmful” compared to other models, potentially due to the way it was trained.
“In my opinion, the number one cost for everyone in AI is and will continue to be inference, and our work shows one important way to reduce it will not work forever,” Tanishq Kumar, a Harvard mathematics student and the first author on the paper, told TechCrunch.