spoonai
TOPARC-AGIBenchmarkAGI

ARC-AGI-3 Just Proved the Uncomfortable Truth -- Best AI Scores 0.37%, Humans Get 100%

The week's most sobering news. ARC-AGI-3 benchmark results show GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6 all scoring below 1%. Untrained humans? 100%. What does this say about AI intelligence?

ARC-AGI-3 benchmark results comparison
Source: ARC Prize

0.37%

One number defined AI this week. That is the score the best AI system achieved on ARC-AGI-3, the newest benchmark designed to test genuine intelligence. Google's Gemini 3.1 Pro got 0.37%. Untrained humans who had never seen the test before? They scored 100%.

Let that sink in. The gap is not incremental. It is a chasm that suggests everything we have been measuring about AI "intelligence" might be measuring the wrong thing entirely.

The Backstory

ARC-AGI was created by Francois Chollet, the Keras creator who wrote the influential 2019 paper "On the Measure of Intelligence." His core argument is deceptively simple: real intelligence is the ability to adapt to novel situations, not the ability to regurgitate patterns from training data.

ARC-AGI-1 launched in 2019 as a set of colorful grid puzzles requiring pattern recognition and transformation. AI systems started near 0%, but by late 2024, OpenAI's o3 model hit 75.7%, prompting breathless headlines about approaching AGI.

But Chollet's team kept raising the bar. ARC-AGI-2 brought harder static puzzles. The real paradigm shift arrived this week with ARC-AGI-3.

Version Format Best AI Score Human Score Key Difference
ARC-AGI-1 (2019) Static grid puzzles 75.7% (o3) 96% Pattern recognition
ARC-AGI-2 (2025) Harder static puzzles 30%+ range 95%+ Complex reasoning
ARC-AGI-3 (2026) Interactive environments 0.37% (Gemini 3.1 Pro) 100% Explore + learn + adapt

Here is the deal: ARC-AGI-3 completely changed the format from static puzzles to interactive environments. Previous versions showed an image and asked for an answer. ARC-AGI-3 requires the AI to interact with an environment, form hypotheses, run experiments, observe results, and adjust its strategy accordingly.

Think of it this way. Previous versions were like a multiple-choice exam. ARC-AGI-3 is like walking into a lab you have never seen before and having to design experiments from scratch.

Breaking Down the Results

Every Frontier Model Failed

The results are not just bad. They are uniformly catastrophic across every major AI lab.

Model Score Developer
Gemini 3.1 Pro 0.37% Google
GPT-5.4 0.26% OpenAI
Claude Opus 4.6 0.25% Anthropic
Untrained human 100% --

The gap between 0.37% and 100% is not a "needs improvement" situation. It signals that a fundamentally different approach is needed.

Structured Search Crushed LLMs

Here is where it gets interesting. During the preview phase, a CNN-based agent using explicit graph search, state tracking, and systematic exploration outperformed every frontier language model by more than 12 percentage points. The best preview-phase score was 12.58%.

This is direct evidence against the "just scale it up" thesis. A system that methodically explores and tracks state beat models with trillions of parameters.

The Prize Money

ARC Prize 2026 has put up $2 million in total prizes. The ARC-AGI-3 track alone carries $850,000, with a $700,000 grand prize for the first agent to score 100%. Nobody has come close.

The Bigger Picture

This week offered a stark contrast. On one side, Reflection AI was reportedly seeking a $25 billion valuation. On the other, the most basic test of intelligence -- adapting to new situations -- showed AI scoring 0.37%.

This collides head-on with the funding frenzy that has defined AI since 2024. Venture capital has poured hundreds of billions into AI startups. In February 2026 alone, global startup funding hit $189 billion, a record, with 83% going to just three AI companies.

But ARC-AGI-3 makes one thing clear: the current LLM paradigm -- bigger models, more data, more compute -- may not lead to genuine intelligence.

This is the context that makes Yann LeCun's AMI Labs so interesting. He raised $1.03 billion, the largest European seed round ever, to build "world models" based on his JEPA (Joint Embedding Predictive Architecture) framework. LeCun has been arguing for years that LLMs are fundamentally limited, and ARC-AGI-3 just handed him powerful ammunition.

What This Means for You

If you build with AI, three takeaways matter.

First, LLMs are not general intelligence. They excel at pattern matching, text generation, and code writing, but they cannot learn and adapt in novel environments. Design your AI products with this limitation in mind.

Second, agent design needs rethinking. Simply connecting an LLM to tools is not enough. The results show that structured exploration, state tracking, and hypothesis testing, which are architectural choices rather than scale choices, actually outperform raw model capability.

Third, be skeptical of benchmarks. A 90% score on MMLU does not mean an AI is smart. Being good at tests and being genuinely intelligent are very different things.

There is $2 million waiting for whoever cracks this. When an AI can solve what any human solves at 100%, that will be a real step toward AGI. We are not there yet.

References

관련 기사

무료 뉴스레터

AI 트렌드를 앞서가세요

매일 아침, 엄선된 AI 뉴스를 받아보세요. 스팸 없음. 언제든 구독 취소.