ARC-AGI-3: Why This Benchmark Is Different From Everything Else

2026 7 min read AGI · Reasoning · Benchmarks

Most AI benchmarks measure what a model has memorized. The ARC Prize series measures something different: whether an AI can reason about problems it has never seen before. And with ARC-AGI-3, that question gets substantially harder - the benchmark has evolved from grid puzzles to interactive agentic tasks in novel environments.

I've been participating in ARC Prize 2026 on Kaggle. Here's what I've learned about why ARC-AGI-3 is a meaningful step change, what makes it so hard, and where the frontier is.

A Quick History: ARC-AGI-1 and ARC-AGI-2

The original ARC (Abstract Reasoning Corpus), designed by François Chollet in 2019, presented models with grid transformation puzzles: given 2–5 input/output grid pairs as examples, figure out the transformation rule and apply it to a new input. The grids were small (up to 30×30), colors limited (10 values), and rules consistent - but the rules themselves could be anything (symmetry, counting, object manipulation, spatial reasoning).

ARC-AGI-2 refined this further, making tasks harder and evaluation stricter. Even GPT-4, trained on essentially the entire internet, struggled to score above 30% on the original ARC. ARC-AGI-2 was harder still.

ARC-AGI-3 is a fundamentally different kind of challenge.

The original format. A human usually sees the rule in seconds (here: mirror the shape to the opposite edge), which is precisely the point: these puzzles are trivially easy for people and brutally hard for models trained on the whole internet.

What ARC-AGI-3 Actually Tests

ARC-AGI-3 moves from passive pattern recognition to interactive agentic reasoning. Instead of looking at grid examples and predicting a transformation, the agent must:

Explore novel environments it has never encountered before, with no pre-given examples of the rules.
Acquire goals dynamically - the task objective isn't stated upfront; the agent has to infer it through interaction.
Build world models on the fly from the consequences of its own actions.
Learn and adapt continuously within a single episode, not just between training and test.

This is much closer to what we mean when we talk about general intelligence: not "apply the learned rule" but "figure out what the rules even are, then use them." The evaluation is correspondingly stricter - partial credit is rare, and approximate reasoning scores zero.

The step change. In ARC-AGI-1 and 2 the objective is handed to you and you answer once. In ARC-AGI-3 you must discover what the goal even is by acting, and all the learning has to happen inside a single episode.

Why This Is Hard for Current AI Systems

The difficulty isn't just increased task complexity. ARC-AGI-3 targets specific failure modes of current LLMs and RL agents:

No in-context examples. ARC-AGI-1/2 gave you demonstrations. ARC-AGI-3 gives you an environment to explore. The agent must generate its own "training data" through interaction.
Compositional generalization under distribution shift. The environments are specifically constructed to be unlike anything in training. Pattern memorization fails by design.
Credit assignment across long horizons. The goal may only become clear after dozens of exploratory steps. Standard RL with sparse rewards struggles here.
Exact correctness. Like its predecessors, ARC-AGI-3 requires precise solutions. "Close enough" scores zero.

Scale does not close this gap, which is the whole design intent. A bigger prior helps enormously on the left column and does almost nothing on the right, because there is nothing in the training distribution to retrieve.

The Landscape of Approaches

World Model-Based Agents

The leading approaches in the ARC Prize 2026 preview competition (held July–August 2025) used agents that explicitly build and maintain internal world models. Rather than mapping observations directly to actions, these agents maintain hypotheses about the environment's dynamics and update them as they explore.

The winner of the preview competition was StochasticGoose (Tufa Labs) with 12.58% accuracy across both public and private hidden environments - a strong showing given how hard the tasks are, but leaving enormous room for improvement.

Test-Time Adaptation

Some approaches use test-time compute to adapt to each specific environment before attempting the task: spending inference budget on active exploration, building a compressed representation of the environment's rules, then using that representation to act efficiently.

Program Synthesis (Inherited from ARC-AGI-1/2)

Program synthesis approaches - which worked well on ARC-AGI-1/2 by searching for a program in a DSL that maps inputs to outputs - are less directly applicable to ARC-AGI-3's interactive setting, but the idea of maintaining a symbolic hypothesis about environment rules is still relevant.

My Submission: Starting From Zero

My current submission is a random agent baseline - proof that the submission pipeline works. It scores near zero, as expected. Getting the data format, API, and submission infrastructure right is step one.

My plan for improving:

Structured exploration. Replace random actions with a principled exploration policy that maximizes information gained about the environment's rules per step.
Hypothesis tracking. Maintain a set of candidate hypotheses about the environment dynamics and prune them as observations come in.
LLM-guided reasoning. Use an LLM to interpret the accumulated observations and propose high-level strategies, grounding abstract reasoning in the actual environment state.
Verification loops. Before committing to a goal-achieving action, simulate it against the current world model to check consistency.

Why ARC-AGI-3 Matters Beyond the Competition

The ARC Prize series has always been Chollet's counterpoint to benchmark saturation. Modern LLMs score near-human on MMLU, HellaSwag, ARC-Easy - but primarily by memorizing patterns from training data, not by reasoning about genuinely novel problems.

ARC-AGI-3 is specifically constructed to resist this. The interactive environments are novel by design. Success requires something closer to actual fluid intelligence: the ability to build accurate world models from sparse data, infer goals from context, and act effectively under uncertainty.

The $2M prize pool reflects the difficulty. For those of us working on inference-time reasoning, planning, and agentic AI, ARC-AGI-3 is the right stress test - it forces honest answers to the question of what "understanding" actually means in a computable sense.

My submission: Kaggle - ARC-AGI-3 Random Agent v0
Competition: ARC Prize 2026 on Kaggle
Updates as I improve the approach. Follow @Asg_Wolverine for progress.