Claude Opus 4.7 hits GPQA 0.9 as benchmark saturation sets in

TL;DR

Claude Opus 4.7 joins Qwen3.6 and Meta Muse Spark at GPQA 0.9, signaling a frontier plateau as Anthropic pivots to product integration to maintain an edge.

On April 16, Anthropic released Claude Opus 4.7, posting a GPQA Diamond score of 0.9. The same score was logged that same day by Alibaba's open-weight Qwen3.6-35B-A3B, and eight days earlier by Meta's Muse Spark. Three labs, three different architectures, one number. That convergence is the real headline.

GPQA tests models on questions that confound credentialed scientists in biology, chemistry, and physics. Hitting 0.9 meant answering doctoral-level problems correctly nine times out of ten. llm-stats.com tracks releases in near real-time, and the April 2026 log reads like a scoreboard where everyone has the same score. Anthropic's own Claude Mythos Preview, released April 7, also sits at 0.9. What was a frontier milestone a year ago is now the expected baseline for competitive models.

The market reaction

The announcement landed harder in design tooling than in any model benchmark forum. One day after Opus 4.7 shipped, Anthropic unveiled Claude Design, a text-to-visual generation tool powered by the new model. Gizmodo reported that Figma's stock dropped immediately on the news. The tool accepts plain-language prompts and uploaded codebases or design files, applies a team's existing color and typography system automatically, and lets users refine outputs through conversation, inline edits, or custom sliders. Finished projects export as PDFs, PowerPoints, or directly into Canva, with a handoff path to Claude Code for building designs into working software.

This is Anthropic running a playbook that mirrors what OpenAI did by wrapping model generations in product surfaces that practitioners already care about. Rolling out as a research preview to Claude Pro, Max, Team, and Enterprise subscribers, Claude Design targets the early-stage ideation workflows where Figma has historically dominated. Whether it displaces that workflow or simply forks it is still an open question, but Wall Street apparently did not wait for the answer.

Benchmark plateau, product differentiation

The simultaneous arrival of multiple models at GPQA 0.9 raises a problem that every serious artificial intelligence review will need to address: when frontier models score identically on the hardest publicly available benchmarks, what separates them? The answer is increasingly latency, cost, integration depth, and context-window reliability, not a leaderboard number. Anthropic's own release cadence hints at this. Opus 4.7 is paired with a product launch; Mythos Preview, also at 0.9, appears positioned as a longer-horizon reasoning system for different use cases entirely.

Qwen3.6-35B-A3B reaching the same benchmark score while remaining open-weight changes the economics of that comparison. Open-weight models allow fine-tuning, self-hosting, and weight inspection that proprietary APIs do not permit. The artificial intelligence act, now in implementation across EU member states, already draws regulatory distinctions between open and closed deployments, distinctions that are quietly shaping enterprise procurement decisions as compliance teams factor model governance into vendor evaluations.

For practitioners, the broader picture

Meta's Muse Spark, covered by CNBC as the first major model from Meta Superintelligence Labs under Alexandr Wang, is explicitly framed as a small, fast, efficient model rather than a flagship. Meta described it as a powerful foundation with the next generation already in development. That framing is telling: the company is treating GPQA 0.9 as a starting point, not a destination. For teams tracking competitive AI capabilities, the floor just moved.

None of the benchmark numbers address the reliability problem. As Gizmodo noted in its Claude Design coverage, large language models have historically struggled with consistent output quality in generative visual tasks. A 0.9 GPQA score describes scientific reasoning over a fixed question set, not generative fidelity across an open-ended design space. These are different capability axes, and conflating them is one of the more persistent errors in how model releases get reported.

The more interesting inflection point heading into late 2026 is not whether any single model holds the GPQA top spot, but whether bundling frontier model releases with product launches becomes the dominant go-to-market strategy. If it does, the competitive moat shifts from benchmark leadership to distribution and integration depth. Anthropic appears to have made that bet with Opus 4.7 and Claude Design. Whether open-weight alternatives can match it on product surface, rather than just raw performance, is the question worth watching.

Frequently asked questions

What is GPQA and why does a 0.9 score matter?
GPQA (Graduate-level Google-Proof Q&A) tests models on PhD-level questions in biology, chemistry, and physics that stump most domain experts. A 0.9 score means 90% accuracy on that set. It was a frontier target as recently as late 2025; it is now shared by at least five models released in April 2026 alone.

Is Claude Opus 4.7 the most capable model currently available?
On GPQA, it is tied with Qwen3.6-35B-A3B, Meta Muse Spark, Claude Mythos Preview, and Zhipu AI's GLM-5.1, all of which posted the same score in April. Differentiation now hinges on latency, cost, context handling, and task-specific benchmarks rather than a single aggregate number.

What is Claude Design and who can use it?
Claude Design is Anthropic's text-to-visual tool powered by Opus 4.7, currently in research preview for Claude Pro, Max, Team, and Enterprise subscribers. It generates slide decks, prototypes, and marketing assets from text prompts, with export options for PDF, PowerPoint, and Canva.

Does Qwen3.6-35B-A3B matching Claude Opus 4.7 affect enterprise decisions?
For teams that can operationalize open-weight models, yes. Identical benchmark performance at open weights changes cost and compliance calculus, particularly under regulatory frameworks like the EU AI Act that treat open and closed model deployments differently.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn