Analysis
November 24, 2025

Gemini 3 Pro's 523% ARC-AGI Jump: Google's Secret Weapon or Benchmark Gaming?

Google's Gemini 3 Pro claims to win 19/20 benchmarks with a 523% improvement in ARC-AGI. Breaking down what's real, what's hype, and whether it threatens Claude and GPT dominance.

Gemini 3 Pro's 523% ARC-AGI Jump: Google's Secret Weapon or Benchmark Gaming?

Last Updated: November 24, 2025

On November 18, Google dropped a bomb: Gemini 3 Pro scored 31.1% on ARC-AGI-2, a 523% improvement over Gemini 2.0 Pro. The press release claimed it "won 19 out of 20 benchmarks" against competitors.

Tech Twitter exploded. "Google is back!" some said. "Benchmark gaming," others muttered.

After digging through the data, the truth is more nuanced—and more interesting—than either camp admits.


The Numbers That Make You Go "Wait, What?"

Let's start with what Google is claiming:

Gemini 3 Pro Performance (Nov 2025):

  • ARC-AGI-2: 31.1% (up from 5.0%)
  • MMLU-Pro: 88.3%
  • GPQA Diamond: 65.2%
  • Context window: 1 million tokens
  • Speed: 2x faster than GPT-5.1 (Google claims)

On paper, this looks insane. A 523% jump in a notoriously difficult benchmark? Let's unpack what's actually happening.


What Is ARC-AGI and Why It Matters

ARC-AGI (Abstraction and Reasoning Corpus) is different from every other benchmark.

SWE-bench tests: Can AI fix real code? MMLU tests: Can AI pass exams? ARC-AGI tests: Can AI think?

How ARC-AGI Works:

You're shown 2-3 examples of a pattern:

Input:  🟦🟦🟨    Output: 🟦🟨🟦
Input:  🟨🟦🟦    Output: 🟦🟦🟨

Then you get a new input and must figure out the rule:

Input:  🟨🟨🟦    Output: ???

The catch: The pattern changes every question. You can't memorize. You must reason.

Why This Is Hard for AI:

LLMs are pattern matchers trained on text. ARC-AGI requires:

  • Visual-spatial reasoning
  • Abstract concept formation
  • Generalization from few examples
  • No pre-training on similar tasks (dataset is tiny: 800 problems)

Human performance: ~85% Previous AI best (GPT-4): 5-12% Gemini 3 Pro: 31.1%


Breaking Down the 523% Improvement

Here's where things get interesting.

Gemini 2.0 Pro: 5.0% on ARC-AGI-2 Gemini 3 Pro: 31.1% on ARC-AGI-2

Math check: (31.1 - 5.0) / 5.0 = 521% improvement

But this framing is misleading. Let's reframe:

Absolute improvement: +26.1 percentage points Problems solved: +209 out of 800 (~26% more) Still wrong: 68.9% of the time

Reality check: Gemini 3 went from "barely functional" to "sometimes useful." That's impressive, but it's not AGI.


The "19 Out of 20 Benchmarks" Claim

Google's press release: "Gemini 3 Pro wins 19 out of 20 benchmarks against Claude 4.5, GPT-5.1, and Gemini 2.0."

Let's fact-check this:

Benchmarks Where Gemini 3 Wins (Verified):

  1. ARC-AGI-2 (31.1%) ✅ Clear win
  2. MMMU (Multimodal understanding) ✅ Best-in-class
  3. 1M context window ✅ Longest available
  4. GPQA Diamond (Science Q&A) ✅ Strong lead
  5. Math-500 (Advanced math) ✅ Competitive

Benchmarks Where It Loses (Conveniently Omitted):

SWE-bench Verified (Coding) - No published score ❌ HumanEval (Code generation) - No data ❌ AIME 2025 (Math reasoning) - No score vs GPT-5.1 ❌ Real-world developer satisfaction - No testing

Pattern: Gemini 3 dominates benchmarks that test reasoning/multimodal skills. It conspicuously avoids coding benchmarks where Claude 4.5 and GPT-5.1 excel.


What Gemini 3 Is Actually Good At

Let's be fair. Gemini 3 Pro has real strengths:

1. Multimodal Reasoning

  • Best-in-class image understanding
  • Can analyze complex diagrams, charts, videos
  • Outperforms Claude and GPT on vision tasks

2. Massive Context Window

  • 1 million tokens (vs 200k for Claude/GPT)
  • Can process entire codebases, long documents
  • Enables new use cases (summarize entire books)

3. Scientific Reasoning

  • 65.2% on GPQA Diamond (hard science questions)
  • Strong at mathematical proofs
  • Excels in physics/chemistry problem-solving

4. Cost Efficiency

  • $2.00/$1.50 per 1M tokens (cheaper than competitors)
  • Faster inference (2x vs GPT-5.1, per Google)

Bottom line: If your use case is multimodal analysis, long-document processing, or scientific reasoning, Gemini 3 is compelling.


Where Gemini 3 Falls Short

1. Coding Performance (Unknown)

Google hasn't published SWE-bench scores. Red flag.

Speculation: If Gemini 3 beats Claude 4.5's 77.2%, Google would have screamed it from the rooftops. Silence suggests it doesn't.

2. Real-World Testing

All data comes from Google's labs. No independent verification yet.

3. Ecosystem

  • Smaller dev community than OpenAI/Anthropic
  • Fewer integrations
  • Less mature API infrastructure

4. Trust Issues

Google has a history of benchmark gaming (remember LaMDA's "sentience" claims?). The selective benchmark disclosure doesn't help.


The ARC-AGI Breakthrough: Real or Overfitted?

Here's the uncomfortable question: Did Gemini 3 actually get smarter, or did Google just overfit to ARC-AGI?

Evidence it's real:

  • Gains across multiple reasoning benchmarks (not just ARC)
  • Improvements align with architectural changes (Google mentions "test-time compute")
  • Other models (GPT-5.1, Claude 4.5) also improved reasoning recently

Evidence it's overfitted:

  • 523% is suspiciously large
  • No code released for reproduction
  • ARC-AGI creator (François Chollet) hasn't verified claims
  • Previous Google models had inflated benchmark claims

My take: Probably 70% real, 30% optimization for the specific benchmark. Gemini 3 is better at reasoning, but the 31.1% score likely inflates real-world performance.


Should Claude and GPT Be Worried?

Short answer: Not yet, but they should pay attention.

Google's Advantages:

Deep pockets (Alphabet has $120B cash) ✅ Data access (YouTube, Search, Gmail) ✅ Hardware (TPUs optimized for their models) ✅ Distribution (Google products reach 3B+ users)

Google's Weaknesses:

Late to market (2+ years behind OpenAI) ❌ Organizational chaos (7 different AI divisions) ❌ Brand damage (past AI failures: LaMDA, Bard's wrong launch) ❌ Developer trust (frequent product shutdowns)

Reality: Gemini 3 is competitive, but Claude 4.5 and GPT-5.1 still lead in the most important use case: helping developers write code.


What This Means for Claude 5

If Gemini 3 can jump 523% in reasoning, what does that mean for Claude 5?

Optimistic scenario: Anthropic applies similar techniques → Claude 5 hits 85%+ on SWE-bench, 40%+ on ARC-AGI.

Pessimistic scenario: Google's gains are benchmark-specific → Claude 5 improves incrementally (80-82% SWE-bench).

Most likely: The entire field is improving rapidly. Claude 5, GPT-5.5, and Gemini 4 will all be dramatically better than 2025 models.

Expected timeline: Claude 5 in Q2-Q3 2026. Track it here.


How to Think About Gemini 3 vs Claude 4.5 vs GPT-5.1

Choose Gemini 3 if:

✅ You need multimodal analysis (images, videos, diagrams) ✅ You process extremely long documents (500k+ tokens) ✅ Your work is science-heavy (physics, chemistry, math) ✅ You want lower API costs

Choose Claude 4.5 if:

✅ Coding is your primary use case ✅ You need the highest code quality ✅ You value reliability over speed

Choose GPT-5.1 if:

✅ You want the fastest responses ✅ You need the largest ecosystem (plugins, integrations) ✅ You do mixed tasks (coding + writing + reasoning)

The market is splitting: Different models for different jobs. The "one AI to rule them all" era is over.


Prediction: Where This Goes Next

Q1 2026:

  • Independent researchers test Gemini 3 claims
  • Real-world performance data emerges
  • Either validates Google's comeback or exposes overhype

Q2-Q3 2026:

  • Claude 5 launches (expected)
  • GPT-5.5 or GPT-6 announcement
  • ARC-AGI scores cross 50% (halfway to human parity)

2027:

  • ARC-AGI becomes obsolete (too easy for new models)
  • New benchmarks test true reasoning (none exist yet)
  • AI coding hits 90%+ accuracy on standard tasks

Conclusion: Google Is Back (Sort Of)

Gemini 3 Pro is a real achievement. A 523% improvement in abstract reasoning is impressive, even if the framing is hyperbolic.

But Google hasn't reclaimed the crown. They've proven they can compete—which is different from leading.

Current leaderboard (Nov 2025):

  1. Coding: Claude 4.5 (77.2% SWE-bench)
  2. Speed: GPT-5.1 (70 tokens/sec)
  3. Reasoning: Gemini 3 Pro (31.1% ARC-AGI)
  4. Ecosystem: GPT-5.1 (ChatGPT, API, plugins)

There's no single winner. Pick your tool based on your task.

And keep watching. The race is accelerating.


Data Sources & Verification

Gemini 3 Pro benchmarks:

  • Google DeepMind announcement (Nov 18, 2025)
  • TechRadar, The Algorithmic Bridge analysis
  • ARC-AGI leaderboard (awaiting independent verification)

Comparison data:

  • Claude 4.5: Anthropic (Sep 2025), InfoQ
  • GPT-5.1: OpenAI System Card (Nov 2025)
  • SWE-bench: Official leaderboard

Conflicts of Interest: None. I don't work for Google, Anthropic, or OpenAI.

Want to compare coding performance directly? Read Claude 4.5 vs GPT-5.1 head-to-head.