TokenFormer reduces training costs drastically. Compared to traditional Transformers, it requires only one-tenth of the computational budget. For example, scaling from 124M to 1.4B parameters was achieved without performance loss.
You are viewing a single comment's thread from: