For example, Meta trained Llama 3 on a set of 15 trillion tokens. (Tokens represent bits of raw data; 1 million tokens is equal to about 750,000 words.) The previous generation, Llama 2, was trained on “only” 2 trillion tokens.
Evidence suggests that scaling up eventually provides diminishing returns; Anthropic and Google reportedly recently trained enormous models that fell short of internal benchmark expectations. But there’s little sign that the industry is ready to meaningfully move away from these entrenched scaling approaches.