TOPARC-AGIBenchmarkAGI

ARC-AGI-3 Just Proved the Uncomfortable Truth -- Best AI Scores 0.37%, Humans Get 100%

Q: Which companies or organizations are mentioned in this article?

The key entities covered in this article include ARC-AGI, Benchmark, AGI, Reasoning.

Q: When was this article published?

This article was published on 2026-03-28 by spoonai.

Q: What are the main topics covered in this article?

This article covers: 0.37%, The Backstory, Breaking Down the Results, The Bigger Picture, What This Means for You.

The week's most sobering news. ARC-AGI-3 benchmark results show GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6 all scoring below 1%. Untrained humans? 100%. What does this say about AI intelligence?

2026년 3월 28일 (토)·5분 소요·

ARC-AGI-3 benchmark results comparison — Source: ARC Prize

0.37%

One number defined AI this week. That is the score the best AI system achieved on ARC-AGI-3, the newest benchmark designed to test genuine intelligence. Google's Gemini 3.1 Pro got 0.37%. Untrained humans who had never seen the test before? They scored 100%.

Let that sink in. The gap is not incremental. It is a chasm that suggests everything we have been measuring about AI "intelligence" might be measuring the wrong thing entirely.

The Backstory

ARC-AGI was created by Francois Chollet, the Keras creator who wrote the influential 2019 paper "On the Measure of Intelligence." His core argument is deceptively simple: real intelligence is the ability to adapt to novel situations, not the ability to regurgitate patterns from training data.

ARC-AGI-1 launched in 2019 as a set of colorful grid puzzles requiring pattern recognition and transformation. AI systems started near 0%, but by late 2024, OpenAI's o3 model hit 75.7%, prompting breathless headlines about approaching AGI.

But Chollet's team kept raising the bar. ARC-AGI-2 brought harder static puzzles. The real paradigm shift arrived this week with ARC-AGI-3.

Version	Format	Best AI Score	Human Score	Key Difference
ARC-AGI-1 (2019)	Static grid puzzles	75.7% (o3)	96%	Pattern recognition
ARC-AGI-2 (2025)	Harder static puzzles	30%+ range	95%+	Complex reasoning
ARC-AGI-3 (2026)	Interactive environments	0.37% (Gemini 3.1 Pro)	100%	Explore + learn + adapt

Here is the deal: ARC-AGI-3 completely changed the format from static puzzles to interactive environments. Previous versions showed an image and asked for an answer. ARC-AGI-3 requires the AI to interact with an environment, form hypotheses, run experiments, observe results, and adjust its strategy accordingly.

Think of it this way. Previous versions were like a multiple-choice exam. ARC-AGI-3 is like walking into a lab you have never seen before and having to design experiments from scratch.

Breaking Down the Results

Every Frontier Model Failed

The results are not just bad. They are uniformly catastrophic across every major AI lab.

Model	Score	Developer
Gemini 3.1 Pro	0.37%	Google
GPT-5.4	0.26%	OpenAI
Claude Opus 4.6	0.25%	Anthropic
Untrained human	100%	--

The gap between 0.37% and 100% is not a "needs improvement" situation. It signals that a fundamentally different approach is needed.

Structured Search Crushed LLMs

Here is where it gets interesting. During the preview phase, a CNN-based agent using explicit graph search, state tracking, and systematic exploration outperformed every frontier language model by more than 12 percentage points. The best preview-phase score was 12.58%.

This is direct evidence against the "just scale it up" thesis. A system that methodically explores and tracks state beat models with trillions of parameters.

The Prize Money

ARC Prize 2026 has put up $2 million in total prizes. The ARC-AGI-3 track alone carries $850,000, with a $700,000 grand prize for the first agent to score 100%. Nobody has come close.

The Bigger Picture

This week offered a stark contrast. On one side, Reflection AI was reportedly seeking a $25 billion valuation. On the other, the most basic test of intelligence -- adapting to new situations -- showed AI scoring 0.37%.

This collides head-on with the funding frenzy that has defined AI since 2024. Venture capital has poured hundreds of billions into AI startups. In February 2026 alone, global startup funding hit $189 billion, a record, with 83% going to just three AI companies.

But ARC-AGI-3 makes one thing clear: the current LLM paradigm -- bigger models, more data, more compute -- may not lead to genuine intelligence.

This is the context that makes Yann LeCun's AMI Labs so interesting. He raised $1.03 billion, the largest European seed round ever, to build "world models" based on his JEPA (Joint Embedding Predictive Architecture) framework. LeCun has been arguing for years that LLMs are fundamentally limited, and ARC-AGI-3 just handed him powerful ammunition.

What This Means for You

If you build with AI, three takeaways matter.

First, LLMs are not general intelligence. They excel at pattern matching, text generation, and code writing, but they cannot learn and adapt in novel environments. Design your AI products with this limitation in mind.

Second, agent design needs rethinking. Simply connecting an LLM to tools is not enough. The results show that structured exploration, state tracking, and hypothesis testing, which are architectural choices rather than scale choices, actually outperform raw model capability.

Third, be skeptical of benchmarks. A 90% score on MMLU does not mean an AI is smart. Being good at tests and being genuinely intelligent are very different things.

There is $2 million waiting for whoever cracks this. When an AI can solve what any human solves at 100%, that will be a real step toward AGI. We are not there yet.

ARC-AGI-3 Just Proved the Uncomfortable Truth -- Best AI Scores 0.37%, Humans Get 100%

0.37%

The Backstory