TechCrunch’s Kyle Wiggers dug into this problem last month and spoke to Sheridan Feucht, a PhD student at Northeastern University studying LLM interpretability.
“It’s kind of hard to get around the question of what exactly a ‘word’ should be for a language model, and even if we got human experts to agree on a perfect token vocabulary, models would probably still find it useful to ‘chunk’ things even further,” Feucht told TechCrunch. “My guess would be that there’s no such thing as a perfect tokenizer due to this kind of fuzziness.”