Analysis
February 2, 2026

GPT-5.1 SWE-bench Score: 76.3% Verified Results & Full Analysis

GPT-5.1 achieves 76.3% on SWE-bench Verified. Compare with Claude 4.5 (77.2%), see AIME 2025 scores, and understand what these benchmarks mean.

GPT-5.1 SWE-bench Score: 76.3% Verified Results & Full Analysis

OpenAI's GPT-5.1 has made waves in the AI coding benchmark space, achieving a 76.3% score on SWE-bench Verified — making it one of the top-performing models for real-world software engineering tasks.

But what does this score actually mean? And how does it compare to Claude 4.5's leading 77.2%? Let's break it down.

What is SWE-bench Verified?

SWE-bench (Software Engineering Benchmark) tests AI models on their ability to solve real GitHub issues from popular open-source projects. Unlike synthetic coding tests, SWE-bench uses actual bugs and feature requests that human developers solved.

SWE-bench Verified is the stricter version — each test case has been manually verified by human experts to ensure:

  • The problem is clearly defined
  • The solution is unambiguous
  • The test cases are fair

This makes SWE-bench Verified the gold standard for measuring AI coding ability.

GPT-5.1 Benchmark Results

Benchmark GPT-5.1 Score Notes
SWE-bench Verified 76.3% 2nd place globally
SWE-bench Pro ~72% More complex tasks
AIME 2025 94.0% Mathematical reasoning
Coding Tasks Top tier Strong at implementation

Key Strengths of GPT-5.1

  1. Mathematical Reasoning: The 94% AIME 2025 score shows exceptional math capabilities
  2. Code Generation: Excels at writing new code from specifications
  3. Multi-step Problems: Handles complex, multi-file changes well
  4. API Design: Strong at creating clean, documented interfaces

Where GPT-5.1 Falls Short

  1. Bug Localization: Sometimes struggles to find the exact location of bugs
  2. Large Codebase Navigation: Less efficient in very large repositories
  3. Edge Cases: Occasionally misses subtle edge cases that Claude catches

GPT-5.1 vs Claude 4.5: Head-to-Head

Metric Claude 4.5 GPT-5.1 Winner
SWE-bench Verified 77.2% 76.3% Claude
AIME 2025 ~88% 94.0% GPT
OSWorld 61.4% ~55% Claude
Error Rate (Replit) 0% ~2% Claude

The Verdict: Claude 4.5 leads in pure coding benchmarks, while GPT-5.1 excels at mathematical reasoning. For software engineering specifically, Claude maintains a slight edge.

What About Gemini 3?

Google's Gemini 3 Pro achieved 31.1% on ARC-AGI-2 — a different benchmark focused on general reasoning rather than coding. While impressive for AGI research, it's not directly comparable to SWE-bench scores.

For coding tasks specifically:

  • Claude 4.5: 77.2% (SWE-bench Verified)
  • GPT-5.1: 76.3% (SWE-bench Verified)
  • Gemini 3: Not directly comparable (different benchmark focus)

Should You Use GPT-5.1 for Coding?

Yes, if you:

  • Need strong mathematical reasoning alongside code
  • Work primarily on greenfield projects (new code)
  • Use OpenAI's ecosystem (Codex integration)

Consider Claude 4.5 if you:

  • Work on large existing codebases
  • Need precise bug fixing
  • Require the absolute best SWE-bench performance
  • Value lower error rates

Looking Ahead: Claude 5

With Claude 4.5 already leading benchmarks at 77.2%, speculation is growing about Claude 5 (expected Q2-Q3 2026). If Anthropic maintains their trajectory, we could see:

  • SWE-bench Verified scores approaching 85%+
  • Enhanced reasoning capabilities
  • Larger context windows for massive codebases
  • Improved real-world coding assistance

Conclusion

GPT-5.1's 76.3% SWE-bench Verified score is impressive — it's the second-best result ever achieved. However, Claude 4.5's 77.2% keeps Anthropic in the lead for pure coding benchmarks.

For developers choosing between models, the difference is marginal. Both are excellent coding assistants. Your choice should depend on:

  • Your existing tool ecosystem
  • Specific use case requirements
  • Pricing and availability

The real winner? Developers who now have multiple world-class AI coding assistants to choose from.


Sources:

  • OpenAI GPT-5.1 System Card (November 2025)
  • Anthropic Claude 4.5 Announcement (September 2025)
  • SWE-bench Official Leaderboard
  • Artificial Analysis AI Rankings (November 2025)

Related Articles

GPT-5.1 SWE-bench Score: 76.3% Verified Results & Full Analysis | Claude 5 Hub