What is SWE-bench?
The benchmark that separates real coding AI from chatbots
The 30-Second Explanation
SWE-bench (Software Engineering Benchmark) gives AI models real bugs from GitHub repositories like Django and scikit-learn. The AI must read the issue description, analyze the codebase, and generate a fix that passes the project's test suite. It's the closest thing we have to testing "can this AI actually code?"
How SWE-bench Actually Works
Step 1: Real GitHub Issues
Researchers collected 2,294 real bugs from 12 popular Python projects. These aren't toy problems—they're actual issues that human developers fixed in production codebases.
Step 2: AI Gets the Context
The AI receives: (1) the GitHub issue description, (2) the repository's codebase at that point in time, and (3) instructions to generate a patch that fixes the bug.
Step 3: Auto-Grading
The AI's code fix is automatically tested against the project's test suite. If tests pass and the behavior matches the human fix, it counts as solved. No partial credit.
Two Versions: Full vs Verified
SWE-bench Full
2,294 problems from real repositories
Issue: Some problems have ambiguous requirements or flaky tests. Scores can be misleading.
SWE-bench Verified ⭐
500 hand-picked, clearly defined problems
This is the gold standard. When people cite SWE-bench scores now, they usually mean Verified.
Current Leaderboard (Nov 2025)
| Rank | Model | SWE-bench Verified | Released |
|---|---|---|---|
| 🥇 1st | Claude Sonnet 4.5 | 77.2% | Sep 2025 |
| 🥈 2nd | GPT-5.1 | 76.3% | Nov 2025 |
| 3rd | Claude Sonnet 3.5 | 49.0% | Jun 2024 |
| 4th | GPT-4 Turbo | 38.0% | Apr 2024 |
Note: Claude 4.5's 77.2% is the highest score ever achieved on SWE-bench Verified. This represents a +28.2 percentage point improvement over Claude 3.5.
Why SWE-bench Matters for Developers
✅ It Tests Real Skills
Unlike "write a function to reverse a string" benchmarks, SWE-bench requires understanding existing code, navigating complex repos, and making targeted fixes without breaking anything.
✅ It's Objective
Either the tests pass or they don't. No subjective human evaluation needed.
✅ It Predicts Real-World Performance
High SWE-bench scores correlate strongly with developer satisfaction when using AI coding assistants. If an AI can fix real GitHub bugs, it can probably help with your production code.
Limitations to Know
Python-Only
All tests are in Python. If you work in JavaScript, Rust, or Go, SWE-bench scores might not fully reflect performance in your language.
Doesn't Test Everything
SWE-bench focuses on bug fixes. It doesn't test greenfield development, architecture design, or code review skills.
Test Coverage Isn't Perfect
An AI could pass tests while introducing subtle bugs in edge cases not covered by the test suite.
The Bottom Line
SWE-bench is the best public benchmark we have for evaluating AI coding ability. It's not perfect, but a high SWE-bench Verified score (like Claude 4.5's 77.2%) is a strong signal that an AI can handle real-world coding tasks.
When choosing an AI coding assistant, SWE-bench scores should be one factor among many—but they're probably the most important single metric available today.
Related Resources
Sources & Further Reading
- • SWE-bench paper: www.swebench.com
- • Claude 4.5 announcement: Anthropic (September 2025)
- • GPT-5.1 System Card: OpenAI (November 2025)
- • Leaderboard data: Verified from official model announcements