ARC-AGI-3: Why This Benchmark Is Different From Everything Else

Most AI benchmarks measure what a model has memorized. The ARC Prize series measures something different: whether an AI can reason about problems it has never seen before. And with ARC-AGI-3, that question gets substantially harder — the benchmark has evolved from grid puzzles to interactive agentic tasks in novel environments.

I've been participating in ARC Prize 2026 on Kaggle. Here's what I've learned about why ARC-AGI-3 is a meaningful step change, what makes it so hard, and where the frontier is.

A Quick History: ARC-AGI-1 and ARC-AGI-2

The original ARC (Abstract Reasoning Corpus), designed by François Chollet in 2019, presented models with grid transformation puzzles: given 2–5 input/output grid pairs as examples, figure out the transformation rule and apply it to a new input. The grids were small (up to 30×30), colors limited (10 values), and rules consistent — but the rules themselves could be anything (symmetry, counting, object manipulation, spatial reasoning).

ARC-AGI-2 refined this further, making tasks harder and evaluation stricter. Even GPT-4, trained on essentially the entire internet, struggled to score above 30% on the original ARC. ARC-AGI-2 was harder still.

ARC-AGI-3 is a fundamentally different kind of challenge.

What ARC-AGI-3 Actually Tests

ARC-AGI-3 moves from passive pattern recognition to interactive agentic reasoning. Instead of looking at grid examples and predicting a transformation, the agent must:

This is much closer to what we mean when we talk about general intelligence: not "apply the learned rule" but "figure out what the rules even are, then use them." The evaluation is correspondingly stricter — partial credit is rare, and approximate reasoning scores zero.

Why This Is Hard for Current AI Systems

The difficulty isn't just increased task complexity. ARC-AGI-3 targets specific failure modes of current LLMs and RL agents:

The Landscape of Approaches

World Model-Based Agents

The leading approaches in the ARC Prize 2026 preview competition (held July–August 2025) used agents that explicitly build and maintain internal world models. Rather than mapping observations directly to actions, these agents maintain hypotheses about the environment's dynamics and update them as they explore.

The winner of the preview competition was StochasticGoose (Tufa Labs) with 12.58% accuracy across both public and private hidden environments — a strong showing given how hard the tasks are, but leaving enormous room for improvement.

Test-Time Adaptation

Some approaches use test-time compute to adapt to each specific environment before attempting the task: spending inference budget on active exploration, building a compressed representation of the environment's rules, then using that representation to act efficiently.

Program Synthesis (Inherited from ARC-AGI-1/2)

Program synthesis approaches — which worked well on ARC-AGI-1/2 by searching for a program in a DSL that maps inputs to outputs — are less directly applicable to ARC-AGI-3's interactive setting, but the idea of maintaining a symbolic hypothesis about environment rules is still relevant.

My Submission: Starting From Zero

My current submission is a random agent baseline — proof that the submission pipeline works. It scores near zero, as expected. Getting the data format, API, and submission infrastructure right is step one.

My plan for improving:

  1. Structured exploration. Replace random actions with a principled exploration policy that maximizes information gained about the environment's rules per step.
  2. Hypothesis tracking. Maintain a set of candidate hypotheses about the environment dynamics and prune them as observations come in.
  3. LLM-guided reasoning. Use an LLM to interpret the accumulated observations and propose high-level strategies, grounding abstract reasoning in the actual environment state.
  4. Verification loops. Before committing to a goal-achieving action, simulate it against the current world model to check consistency.

Why ARC-AGI-3 Matters Beyond the Competition

The ARC Prize series has always been Chollet's counterpoint to benchmark saturation. Modern LLMs score near-human on MMLU, HellaSwag, ARC-Easy — but primarily by memorizing patterns from training data, not by reasoning about genuinely novel problems.

ARC-AGI-3 is specifically constructed to resist this. The interactive environments are novel by design. Success requires something closer to actual fluid intelligence: the ability to build accurate world models from sparse data, infer goals from context, and act effectively under uncertainty.

The $2M prize pool reflects the difficulty. For those of us working on inference-time reasoning, planning, and agentic AI, ARC-AGI-3 is the right stress test — it forces honest answers to the question of what "understanding" actually means in a computable sense.

My submission: Kaggle — ARC-AGI-3 Random Agent v0
Competition: ARC Prize 2026 on Kaggle
Updates as I improve the approach. Follow @Asg_Wolverine for progress.