In an evaluation designed to test an AI agent’s ability to help with airline booking tasks, like modifying a flight reservation, the new 3.5 Sonnet managed to complete less than half of the tasks successfully. In a separate test involving tasks like initiating a return, 3.5 Sonnet failed roughly a third of the time.
You are viewing a single comment's thread from: