spoonai
TOPAIbenchmarkAGI

ARC-AGI-3 Drops and Every Frontier Model Scores Below 1%

The ARC Prize Foundation's new benchmark exposes a humbling gap: humans solve 100% of tasks while the best AI manages just 0.37%.

·8분 소요·ARC-AGI-3
ARC-AGI-3 benchmark results
Source: ARC Prize Foundation

The Benchmark That Humbled Every Frontier Model

On March 24, 2026, the ARC Prize Foundation released ARC-AGI-3—and the results were not what the AI industry expected. Every single frontier model tested scored below 1%. Gemini 3.1 Pro reached the top with just 0.37%. GPT-5.4 managed 0.26%. Claude Opus 4.6 hit 0.25%. Grok-4.20 scored 0.00%. Meanwhile, untrained humans solved 100% of the tasks.

Here's the deal: this isn't a statistical glitch or a metric that favors organic intelligence. This is a genuine, structural gap. And it raises a uncomfortable question for everyone betting on AI—if our best models can't outthink humans on abstract reasoning, how far are we really from genuine artificial general intelligence?

What Is ARC-AGI, and Why Does It Matter?

The Abstraction and Reasoning Corpus (ARC) is not a new concept. Françoise Chollet, the AI researcher behind the benchmark, designed it to measure what he calls "intelligence"—not pattern recognition or memorization, but the ability to identify underlying patterns in abstract visual tasks and apply them to new, unseen scenarios.

Think of it like this: show a person a visual pattern they've never encountered, ask them to understand the rule, and then apply it to a different scenario. ARC tasks are designed around this exact principle. They're not about having seen something before. They're about reasoning from first principles.

The original ARC released in 2019 was challenging but not brutal. By 2024–2025, models were starting to achieve decent performance on ARC-AGI-2, with Gemini reaching 77.1%. The AI research community thought trajectory was clear: models were getting better at abstract reasoning.

Then came ARC-AGI-3, and that narrative collapsed.

The Numbers: A Stark Reality Check

Interactive Reasoning at the Heart of ARC-AGI-3

What changed with version 3? The ARC Prize Foundation leaned harder into interactive reasoning. Instead of just showing a static input–output pair and asking models to infer the pattern, ARC-AGI-3 tasks require a dialogue. Models make predictions, receive feedback, and must adjust their reasoning. This is closer to how humans actually solve novel problems: trial, observation, refinement.

Humans excel at this. We ask clarifying questions. We notice when our hypothesis doesn't fit the data. We adjust. Frontier AI models, trained on static next-token prediction, struggle when the task demands active reasoning loops.

Leaderboard Results

Model Score Percentage
Untrained Humans 100/100 100.00%
Gemini 3.1 Pro 0.37/100 0.37%
GPT-5.4 0.26/100 0.26%
Claude Opus 4.6 0.25/100 0.25%
Grok-4.20 0/100 0.00%
Preview Best (Interactive Reasoning) 12.58/100 12.58%

The "Preview Best" result is interesting. During a controlled preview phase, systems optimized for the interactive reasoning format achieved 12.58%—still a vast distance from human performance, but nearly 35 times better than the launch leaderboard results. This suggests that adaptive, iterative approaches hold more promise than the static inference models have demonstrated so far.

The Bigger Picture: What This Really Means

For years, AI researchers have debated whether benchmarks actually measure what they claim to measure. ARC-AGI-3 forces a reckoning. This isn't a narrow test of image classification or language fluency. It's a direct comparison of abstract reasoning ability—the kind of thinking that humans use to solve completely novel problems.

The 100% human baseline is not inflated. These are untrained people who've never seen these specific tasks before. They're reasoning from scratch, pattern-matching, making hypotheses, testing them. And they're doing it at a scale that models cannot currently approach.

"The gap between 0.37% and 100% is not a matter of tuning hyperparameters or adding more data. It represents a fundamental difference in how humans and machines approach novel abstract problems. The question is whether that gap is narrowable at all, or whether it points to something structurally different about human cognition."

This blockquote captures the existential concern: Are we building the right kind of models for AGI, or are we hitting a ceiling that architecture and scale alone cannot overcome?

The ARC Prize Foundation isn't being gratuitously harsh. They're offering $2 million to any team that can build a system matching untrained human performance. That's a real financial incentive. The foundation is betting that the problem is solvable—and also that it's genuinely hard. No one has claimed the prize yet.

What Needs to Change?

The interactive reasoning format hints at part of the answer. Humans don't solve novel problems in a single feedforward pass. We iterate. We make mistakes and learn from them. We ask questions. Current frontier models are optimized for one-shot inference—consume input, generate output. That architecture is fundamentally different from the recursive, error-correcting loops that human reasoning employs.

Some researchers are already exploring this territory. Reinforcement learning frameworks, chain-of-thought reasoning, and tool-use agents all represent attempts to move beyond pure next-token prediction. But ARC-AGI-3 suggests these incremental improvements may not be enough. We might need new model architectures, new training paradigms, or fundamentally different approaches to reasoning altogether.

The preview best result (12.58%) offers a ray of hope. It shows that when systems are designed with interactive reasoning in mind—with feedback loops and the ability to revise hypotheses—performance jumps dramatically. Whatever approach wins the $2 million prize will likely involve:

  • Iterative refinement of hypotheses rather than single-pass inference
  • Ability to incorporate feedback and adjust reasoning chains
  • Pattern recognition that generalizes to genuinely novel scenarios
  • Explicit reasoning about abstract rule structures rather than implicit pattern memorization

Another crucial factor: the models tested on the leaderboard are primarily vision–language models. They take visual inputs, process them through convolutional or transformer layers, and produce text outputs. Human reasoning isn't bound by a single modality. We blend vision, language, spatial reasoning, analogy, and more. It's possible that multi-modal models or architectures that can flexibly combine reasoning types would perform better. But that's speculation. The results are what they are: 0.37% is 0.37%.

Context: Why ARC-AGI-2 Performance Doesn't Transfer

One natural question: why such a dramatic drop from ARC-AGI-2 to ARC-AGI-3? The benchmark didn't just get harder—it changed the game fundamentally. ARC-AGI-2 allowed models to study pairs of input–output examples and infer the rule. Gemini's 77.1% performance came from scaling up models and using diverse training data. That worked for static pattern inference.

ARC-AGI-3 introduces interactive feedback. Models can no longer rely on pattern matching across examples. They must reason about abstract rules in real time, revise their understanding based on feedback, and apply learned rules to completely novel scenarios in the same interaction. This isn't a 10% harder version of ARC-AGI-2. It's a different kind of problem entirely.

That's actually the point. The ARC Prize Foundation is not trying to create a winnable benchmark. They're trying to create a benchmark that measures something genuinely difficult—something that separates humans from machines on a fundamental capability rather than on speed or pattern recognition in a narrow domain.

The Prize and the Timeline

The $2 million prize pool sits unclaimed. To win, a team needs to build a system that reaches 100% on the test set without being trained on it. That's the public definition. What's unstated is the timeline ambiguity: how long does the prize remain open? Is this a benchmark for 2026, 2030, 2040?

Given the current state of the art, hitting 100% feels decades away, not years. But frontier AI moves fast. Unexpected breakthroughs happen. Scaling laws break. New architectures emerge. In 2020, if you'd told researchers that a language model could write functional code, most would have said "not for a decade." GPT-4 did it in four years.

Still, 0.37% to 100% is not a typical scaling problem. It's a ceiling crash.

What Happens Next?

ARC-AGI-3 is now the benchmark that matters for anyone seriously thinking about AGI. It's not perfect—no benchmark is—but it's harder to game than anything else currently available. You can't train on it without poisoning the test set. You can't brute-force it with scale alone. You have to actually build something that reasons.

In the weeks and months following the March 24 launch, expect:

  • Teams diving into the interactive reasoning space, building agents that can iterate and refine hypotheses
  • Researchers publishing papers on novel architectures designed specifically for abstract reasoning
  • Scaling laws being tested against this benchmark to see if they still hold
  • Potential debate about whether 0.37% is real ceiling or just the starting point of a new scaling curve

The real question is philosophical: does this benchmark measure intelligence, or does it measure something else—something that humans happen to be good at but that future AI systems might solve through orthogonal methods we haven't imagined?

That debate is healthy. But the raw numbers are undeniable. Humans 100%, frontier models below 1%. The gap is real, and closing it will require more than incremental improvements to existing approaches.


References

관련 기사

무료 뉴스레터

AI 트렌드를 앞서가세요

매일 아침, 엄선된 AI 뉴스를 받아보세요. 스팸 없음. 언제든 구독 취소.