Claude Opus 4.7 scores 64.3% on SWE-bench Pro, outpacing GPT-5.4

TL;DR

Claude Opus 4.7 leads SWE-bench Pro, CursorBench, and SWE-bench Verified with sharply reduced tool errors and stronger multi-agent capabilities for long autonomous workflows.

64.3% on SWE-bench Pro. That single number explains why Anthropic shipped Claude Opus 4.7 today, and it is the figure most practitioners will use to decide whether to migrate their pipelines.

The Next Web reports that Opus 4.7 finishes well ahead of GPT-5.4 at 57.7% and Gemini 3.1 Pro at 54.2% on SWE-bench Pro, which evaluates a model's ability to resolve real issues from open-source repositories. The jump from Opus 4.6 is equally striking: the previous generation scored 53.4% on the same tasks. On SWE-bench Verified, a curated subset, Opus 4.7 reaches 87.6%, against 80.6% for Gemini 3.1 Pro and 80.8% for its predecessor.

Anthropic is releasing this model at a charged commercial moment. The company runs at a $30 billion annualized revenue rate, is fielding investor interest at roughly an $800 billion valuation, and has entered early IPO talks. Opus 4.7 has to justify those numbers by becoming the model that engineering teams and enterprises choose to build on, not just benchmark against.

The benchmark picture

CursorBench, which measures autonomous coding performance inside the widely used AI code editor, shows Opus 4.7 at 70%, up from 58% on the prior version. That jump reflects conditions closer to actual deployment than abstract question-answering datasets: the model is operating inside tooling, against real codebases, under multi-step instruction sequences that developers issue daily.

The agentic results deserve particular attention. Anthropic reports a 14% improvement in multi-step agentic reasoning alongside a two-thirds reduction in tool errors relative to Opus 4.6. For artificial intelligence systems running hours-long autonomous workflows, tool error rate is often more consequential than headline accuracy. One misfired API call can silently corrupt every downstream step in a long pipeline. That reduction in failure rate is a more actionable metric than most leaderboard scores for practitioners deploying agents in production environments.

Opus 4.7 also brings multi-agent coordination for extended tasks and a 3x improvement in image resolution handling. Pricing sits at $5 per million input tokens and $25 per million output tokens, accessible through Claude plans and enterprise channels including Amazon Bedrock, Vertex AI, and Microsoft Foundry.

What the numbers actually mean

SWE-bench Pro is the closest thing available to a systematic artificial intelligence review for software engineering capability, but benchmark caveats apply. Models trained on large GitHub corpora may have encountered related code, so scores should be read as directional signals rather than ground truth. That said, consistent leads across SWE-bench Pro, SWE-bench Verified, and CursorBench are harder to attribute to benchmark exposure alone.

The broader competitive landscape is unusually dense right now. LLM Stats shows this spring produced Meta's Muse Spark, Google's Gemma 4 family, and Anthropic's own Claude Mythos preview all within a two-week window. According to CNBC, Meta is positioning Muse Spark on efficiency rather than top-tier capability, effectively ceding the raw performance race to Anthropic and Google for now. Opus 4.7's numbers suggest Anthropic is not ceding anything in that tier.

For teams already embedded in Anthropic's tooling, 9to5Mac notes that Anthropic simultaneously shipped a redesigned Claude Code interface with scheduled routines that execute on cloud infrastructure without the developer's machine being online. A stronger base model becomes more useful when it can operate unattended, and these two releases are clearly paired by design.

Whether benchmark leadership converts to developer loyalty depends on where the error reductions actually show up in production codebases. The gap over GPT-5.4 is real on paper. Real-world deployments over the next few weeks will say whether it holds under the messier conditions of enterprise software.

FAQ

What is SWE-bench Pro?
SWE-bench Pro evaluates models on unmodified issues drawn from open-source repositories, measuring whether a model can produce a working code patch. It is harder than SWE-bench Verified and is currently the most demanding publicly available benchmark for software engineering capability.

How does Claude Opus 4.7 compare to GPT-5.4 and Gemini 3.1 Pro?
On SWE-bench Pro, Opus 4.7 scores 64.3% against GPT-5.4's 57.7% and Gemini 3.1 Pro's 54.2%. On CursorBench it reaches 70%, up from 58% on its predecessor. The improvement generation-over-generation is larger than the margin over any single competitor.

How much does Claude Opus 4.7 cost?
Anthropic prices Opus 4.7 at $5 per million input tokens and $25 per million output tokens. It is available through standard Claude plans and enterprise platforms including Amazon Bedrock, Vertex AI, and Microsoft Foundry.

What is Claude Mythos and how is it different from Opus 4.7?
Claude Mythos is a separate model in limited preview that Anthropic has declined to release publicly due to its capabilities in security-relevant tasks. As reported by PBS NewsHour, roughly 40 companies are testing it to identify vulnerabilities rather than deploy it commercially. Opus 4.7 is Anthropic's generally available production flagship.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn