AI Has a Secret: We’re Still Not Sure How to Test for Human Levels of Intelligence
We need to know when machines are getting close to human-level reasoning, with all the safety, ethical, and moral questions this raises.
Two of San Francisco’s leading players in artificial intelligence have challenged the public to come up with questions capable of testing the capabilities of large language models (LLMs) like Google Gemini and OpenAI’s o1. Scale AI, which specializes in preparing the vast tracts of data on which the LLMs are trained, teamed up with the Center for AI Safety (CAIS) to launch the initiative, Humanity’s Last Exam.
The Challenge of Testing AI: A New Frontier in Intelligence Assessment
In a significant development in the artificial intelligence landscape, Scale AI and the Center for AI Safety (CAIS) have launched an ambitious initiative called "Humanity's Last Exam." This project, offering $5,000 prizes for the TOP 50 selected questions, aims to create new ways to evaluate advanced AI systems, particularly as traditional testing methods become increasingly inadequate.
The Current Testing Dilemma
The challenge facing AI evaluation is multifaceted. Modern Large language Models (LLMs) like Google Gemini and OpenAI's latest offerings are already excelling at conventional tests in fields ranging from intelligence to law. However, this success raises a crucial question: Are these achievements meaningful when the AI systems may have already encountered the test content during their training?
The problem is set to intensify. According to Epoch AI's projections, by 2028, AI systems will have effectively processed all human-written content. This milestone presents a fundamental challenge in continuing to assess AI capabilities accurately.
Emerging Complications
Several key issues complicate the testing landscape:
Data Collection Evolution: Some experts advocate for "embodied AI" solutions, where systems learn through real-world interactions. Tesla's autonomous vehicles and Meta's Ray-Ban smart glasses exemplify this approach, collecting real-world data through sensors and cameras.
Intelligence Definition: The fundamental challenge of defining and measuring intelligence, particularly Artificial General Intelligence (AGI), remains. Traditional IQ tests have long been criticized for their narrow scope, and AI faces similar limitations in its evaluation metrics.
New Testing Approaches
The field is seeing innovative attempts to create more comprehensive testing methods:
The ARC Solution
François Chollet's "abstraction and reasoning corpus" (ARC) represents a notable advancement in AI testing. Unlike traditional benchmarks, ARC tests an AI's ability to:
Current results are telling: Leading LLMs like OpenAI's latest models and Anthropic's Sonnet 3.5 achieve only 21% success rates on the ARC public leaderboard. Even with more sophisticated approaches reaching 50%, these scores remain well below human performance levels of over 90%.
Scale/CAIS Initiative
The Humanity's Last Exam project takes a unique approach by:
Looking Ahead
As AI systems continue to advance, the challenge of testing them becomes increasingly complex. The field must not only develop ways to measure current AI capabilities but also prepare for testing potential superintelligent systems - a challenge that pushes the boundaries of our current understanding of intelligence assessment.
The ongoing efforts to create new testing methodologies reflect a crucial understanding: as AI systems approach and potentially surpass human-level reasoning, we need robust ways to evaluate their capabilities, with significant implications for safety, ethics, and governance in the AI age.
Article