RE: LeoThread 2025-02-01 10:54

Take basic multiplication. Standard LLMs, such as ChatGPT and GPT-4, fail badly at it. In early 2023 when Dziri’s team asked GPT-4 to multiply two three-digit numbers, it initially succeeded only 59% of the time. When it multiplied two four-digit numbers, accuracy fell to just 4%.

The team also tested the LLMs on tasks like Einstein’s riddle, where it also had limited success. GPT-4 always got the right answer when the puzzle involved two houses with two attributes per house. But the accuracy fell to 10% when the complexity of the puzzle increased to four houses with four attributes per house. For the original version in Life International — five houses, each with five attributes — the success rate was 0%.