RE: LeoThread 2024-10-22 21:22

Current results are telling: Leading LLMs like OpenAI's latest models and Anthropic's Sonnet 3.5 achieve only 21% success rates on the ARC public leaderboard. Even with more sophisticated approaches reaching 50%, these scores remain well below human performance levels of over 90%.

Scale/CAIS Initiative

The Humanity's Last Exam project takes a unique approach by:

Crowdsourcing test questions from a broad expert coalition
Keeping winning questions private to prevent AI systems from "studying" for the test
Creating a more dynamic and unpredictable evaluation framework