This efficiency is evident in benchmarks. With a computational budget of 30B tokens, TokenFormer achieved a perplexity of 11.77, compared to 13.34 for Transformers trained from scratch. Lower perplexity = better language modeling.
You are viewing a single comment's thread from: