Dziri’s team thought that maybe the LLMs simply hadn’t seen enough examples in their training data, so they fine-tuned GPT-3 on 1.8 million examples of multiplying two numbers. Then, when they showed it new problems, the LLM aced them — but only if they were sufficiently similar to what it had seen during training. For example, the training data included the multiplication of two three-digit numbers, and of a two-digit number with a four-digit number, but when the model was asked to multiply a four-digit number with a three-digit number, it succeeded only 2% of the time. “If they are truly reasoning and understanding certain tasks, they should get the implicit algorithm,” Dziri said. That’s not what her team saw. “That raises a lot of questions about how LLMs perform tasks and whether they’re doing true reasoning.”
You are viewing a single comment's thread from: