Claude Opus 4.7 Hits 87% SWE-Bench as Labs Crowd April Releases

TL;DR

Anthropic's Claude Opus 4.7 leads a crowded April 2026 AI release week alongside Meta's Muse Spark, Google's Gemma 4, and Zhipu AI's open-source GLM-5.1.

Claude Opus 4.7 landed on April 16 with an 87.6% score on SWE-Bench Verified, displacing Opus 4.6 as Anthropic's most capable publicly accessible model. It is available via Claude.ai, the API, and Amazon Bedrock, with a stated focus on agentic coding, complex multi-step reasoning, and long-horizon autonomous workflows.

That benchmark number carries weight. SWE-Bench Verified tests real-world GitHub issues rather than synthetic tasks, and 87.6% approaches territory where AI systems become plausible for unsupervised production software engineering. A harder companion metric, SWE-Bench Pro at 64.3%, is designed to reduce data contamination, a consideration that matters as artificial intelligence review of benchmark validity tightens across the research community.

The release arrives under complicated circumstances. AOL documents a recent wave of user complaints about Claude's performance, tied to Anthropic quietly reducing the model's default compute effort to cut token spend. The company confirmed the change appeared in its changelog but did not proactively communicate it, drawing sharp criticism from developers. For a lab whose brand rests explicitly on trustworthiness, that narrative carries real cost.

April's broader release calendar showed the field moving on multiple fronts simultaneously. On April 8, Meta unveiled Muse Spark, the first model from Meta Superintelligence Labs, built around Alexandr Wang. Per CNBC, Wang's recruitment carried a $14.3 billion investment in Scale AI; Muse Spark is the first visible output of that bet nine months later. Meta positioned the model as fast and efficient rather than frontier-quality, a deliberate choice following a disappointing open-source launch the prior April that prompted Zuckerberg to restructure the AI organization entirely.

Muse Spark's GPQA score of 0.9 matches several competing models on that measure. Meta's stock climbed 6.5% on announcement day, though the move coincided with broader market gains tied to geopolitical news rather than AI sentiment in isolation.

Google shipped four Gemma 4 variants on April 2: a mixture-of-experts 26B-A4B, a dense 31B, and two efficiency-focused E2B and E4B versions spanning compute tiers from edge to data center. NVIDIA's developer platform already supports the Gemma family on Blackwell and Hopper hardware, shortening the gap between model release and production-ready inference for practitioners who deploy on NVIDIA infrastructure.

The open-source picture

Zhipu AI's GLM-5.1 hit on April 7 with a GPQA of 0.9, placing a Chinese lab's open-weight release numerically alongside proprietary frontier scores on that benchmark. Claude Mythos Preview arrived the same day but serves a different purpose: a cybersecurity-specialized model restricted to infrastructure partners capable of deploying it responsibly. As The Financial Express notes, Mythos is not a diluted version of Opus 4.7 but a specialized frontier system for offensive and defensive security workflows, withheld from general release pending additional safety controls.

Alibaba's Qwen3.6 Plus, released March 31, rounds out a dense stretch. Major labs are now shipping significant models on roughly monthly cadences, compressing the evaluation window available to practitioners before the next release cycle begins.

What the convergence signals

Multiple models now cluster at GPQA 0.9, which historically indicates either genuine capability saturation at that level or benchmark overfitting. The artificial intelligence research community has watched this pattern before with MMLU and HumanEval: scores plateau not because models stop improving, but because the benchmark stops discriminating. Harder variants like SWE-Bench Pro exist precisely to push that ceiling further out.

On the governance front, OpenAI announced a Safety Fellowship running from September 2026 through February 2027, funding external researchers on robustness, privacy, and agent oversight. Campus Technology covers the initiative, which mirrors established programs at Anthropic and Google DeepMind. The timing, as the artificial intelligence act moves into enforcement phases across the EU, appears deliberate rather than coincidental.

April's release density puts more capable models in practitioners' hands than any prior year's equivalent period. Yet the Anthropic transparency episode is a concrete reminder that benchmark leadership and consistent deployment quality are separate problems. Which labs manage both simultaneously will define the competitive map heading into 2027.

Frequently asked questions

What is Claude Opus 4.7's SWE-Bench score?
Anthropic reports 87.6% on SWE-Bench Verified and 64.3% on the harder SWE-Bench Pro, both significant improvements over Opus 4.6 on software engineering tasks.

What is Meta Muse Spark?
Muse Spark is the first model from Meta Superintelligence Labs, developed under Alexandr Wang following Meta's $14.3 billion investment in Scale AI. It targets speed and efficiency rather than frontier-level performance.

How does Claude Mythos Preview differ from Opus 4.7?
Mythos is a cybersecurity-focused frontier model restricted to select infrastructure partners. Opus 4.7 is general-purpose, publicly accessible via Claude.ai, the API, and Amazon Bedrock.

Why are so many models scoring 0.9 on GPQA?
GPQA is a graduate-level science reasoning benchmark. Clustering at 0.9 likely signals that the benchmark is saturating at the top end and no longer reliably distinguishing between leading models.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn