You are viewing a single comment's thread from:

RE: LeoThread 2024-10-22 21:22

in LeoFinance4 months ago

Current results are telling: Leading LLMs like OpenAI's latest models and Anthropic's Sonnet 3.5 achieve only 21% success rates on the ARC public leaderboard. Even with more sophisticated approaches reaching 50%, these scores remain well below human performance levels of over 90%.

Scale/CAIS Initiative

The Humanity's Last Exam project takes a unique approach by:

  • Crowdsourcing test questions from a broad expert coalition
  • Keeping winning questions private to prevent AI systems from "studying" for the test
  • Creating a more dynamic and unpredictable evaluation framework