Current results are telling: Leading LLMs like OpenAI's latest models and Anthropic's Sonnet 3.5 achieve only 21% success rates on the ARC public leaderboard. Even with more sophisticated approaches reaching 50%, these scores remain well below human performance levels of over 90%.
Scale/CAIS Initiative
The Humanity's Last Exam project takes a unique approach by:
- Crowdsourcing test questions from a broad expert coalition
- Keeping winning questions private to prevent AI systems from "studying" for the test
- Creating a more dynamic and unpredictable evaluation framework