What is SWE-bench?

The benchmark that separates real coding AI from chatbots

The 30-Second Explanation

SWE-bench (Software Engineering Benchmark) gives AI models real bugs from GitHub repositories like Django and scikit-learn. The AI must read the issue description, analyze the codebase, and generate a fix that passes the project's test suite. It's the closest thing we have to testing "can this AI actually code?"

How SWE-bench Actually Works

Step 1: Real GitHub Issues

Researchers collected 2,294 real bugs from 12 popular Python projects. These aren't toy problems—they're actual issues that human developers fixed in production codebases.

Step 2: AI Gets the Context

The AI receives: (1) the GitHub issue description, (2) the repository's codebase at that point in time, and (3) instructions to generate a patch that fixes the bug.

Step 3: Auto-Grading

The AI's code fix is automatically tested against the project's test suite. If tests pass and the behavior matches the human fix, it counts as solved. No partial credit.

Two Versions: Full vs Verified

SWE-bench Full

2,294 problems from real repositories

Issue: Some problems have ambiguous requirements or flaky tests. Scores can be misleading.

SWE-bench Verified ⭐

500 hand-picked, clearly defined problems

This is the gold standard. When people cite SWE-bench scores now, they usually mean Verified.

Current Leaderboard (Nov 2025)

RankModelSWE-bench VerifiedReleased
🥇 1stClaude Sonnet 4.577.2%Sep 2025
🥈 2ndGPT-5.176.3%Nov 2025
3rdClaude Sonnet 3.549.0%Jun 2024
4thGPT-4 Turbo38.0%Apr 2024

Note: Claude 4.5's 77.2% is the highest score ever achieved on SWE-bench Verified. This represents a +28.2 percentage point improvement over Claude 3.5.

Why SWE-bench Matters for Developers

✅ It Tests Real Skills

Unlike "write a function to reverse a string" benchmarks, SWE-bench requires understanding existing code, navigating complex repos, and making targeted fixes without breaking anything.

✅ It's Objective

Either the tests pass or they don't. No subjective human evaluation needed.

✅ It Predicts Real-World Performance

High SWE-bench scores correlate strongly with developer satisfaction when using AI coding assistants. If an AI can fix real GitHub bugs, it can probably help with your production code.

Limitations to Know

⚠️

Python-Only

All tests are in Python. If you work in JavaScript, Rust, or Go, SWE-bench scores might not fully reflect performance in your language.

⚠️

Doesn't Test Everything

SWE-bench focuses on bug fixes. It doesn't test greenfield development, architecture design, or code review skills.

⚠️

Test Coverage Isn't Perfect

An AI could pass tests while introducing subtle bugs in edge cases not covered by the test suite.

The Bottom Line

SWE-bench is the best public benchmark we have for evaluating AI coding ability. It's not perfect, but a high SWE-bench Verified score (like Claude 4.5's 77.2%) is a strong signal that an AI can handle real-world coding tasks.

When choosing an AI coding assistant, SWE-bench scores should be one factor among many—but they're probably the most important single metric available today.

Related Resources

Sources & Further Reading

  • • SWE-bench paper: www.swebench.com
  • • Claude 4.5 announcement: Anthropic (September 2025)
  • • GPT-5.1 System Card: OpenAI (November 2025)
  • • Leaderboard data: Verified from official model announcements