Five Models Hit GPQA 0.9 in April as Anthropic Ships Claude Design

TL;DR

Benchmark convergence at the April 2026 frontier coincides with Anthropic's new Claude Design tool for prototypes, pitch decks, and brand assets.

Four organizations shipped models scoring 0.9 on the GPQA benchmark within roughly three weeks this April. That kind of convergence at the top of the leaderboard is worth paying attention to: it suggests the gap between frontier labs is closing faster than many expected.

According to LLM Stats, which tracks releases across the industry, Claude Opus 4.7 (Anthropic, April 16), Muse Spark (Meta, April 8), GLM-5.1 (Zhipu AI, April 7), and Qwen3.6 Plus (Alibaba Cloud, March 31) all landed at GPQA 0.9. Claude Mythos Preview, a separate Anthropic research release on April 7, also hits that mark. Five distinct models, four organizations, one benchmark tier in the span of a few weeks.

GPQA — Graduate-Level Google-Proof Q&A — is a dataset of expert-level science questions designed to resist simple lookup. A 0.9 score means the model answers correctly nine times out of ten on questions that stump most PhD holders outside their own specialty. It is a demanding bar, and it is no longer exclusive.

The open-source picture

Not every April release reached that ceiling. Google's Gemma 4 family, which landed on April 2 across four variants, shows meaningful spread: the 31B dense model scores 0.8, while the smallest mixture-of-experts variant, Gemma 4 E2B, reaches only 0.4. That spread is useful information for practitioners. Efficiency-focused deployments can pick a point on the cost-capability curve rather than defaulting to the largest model available.

GLM-5.1 from Zhipu AI is notable for a different reason: it is open source and hits 0.9. Open-weight models have been closing on proprietary alternatives across multiple benchmark cycles, but matching frontier scores on GPQA specifically — a test of scientific reasoning, not just language fluency — marks a meaningful step for the open artificial intelligence ecosystem.

The Claude Design announcement

Anthropic did not stop at a model update. On April 17, the company launched Claude Design, a research preview powered by Opus 4.7. As 9to5Mac reported, Claude Design joins Claude Cowork and Claude Code in Anthropic's Mac application suite, positioning it as a third pillar of a broader productivity platform.

The workflow starts with brand extraction. During onboarding, the tool reads a team's codebase and design files to construct a design system — colors, typography, components — that then applies automatically across projects. Users can prompt in natural language, upload images, DOCX, PPTX, or XLSX files, or use a web capture tool to pull visual elements directly from a live site. Output formats include PDF, PPTX, Canva export, and standalone HTML, with handoff to Claude Code built in from launch.

MacRumors notes that Anthropic positions Opus 4.7 specifically as its most capable vision model, with improved resolution handling and what the company describes as better aesthetic judgment on professional tasks. The distinction from image generators matters for enterprise buyers: Claude Design produces structured layouts from brand assets, not synthetic imagery. The target users Anthropic names — founders, product managers, and marketers without a design background — signal a move toward the creative productivity space that tools like Canva already occupy.

Market reaction and what it signals

Gizmodo reported that Figma's stock dropped immediately after the Claude Design announcement, with Wall Street reading the product as a direct threat to design software incumbents. Whether that reaction holds is a separate question from whether the underlying artificial intelligence delivers at production scale. Research previews have a consistent history of impressing in demos and underperforming under sustained enterprise load.

Anthropic is careful to frame Claude Design as complementary to existing tools. Canva's CEO appears in the launch press release describing seamless handoff between the two products — a framing that contains real strategic information. A product that integrates with existing stacks reaches adoption faster than one that demands workflow replacement. The Claude Code handoff, already present at launch, further suggests Anthropic is prioritizing developer-to-designer pipelines over direct competition with dedicated design tools.

The broader artificial intelligence review of April 2026 points to commoditization at the benchmark ceiling and product differentiation below it. When five models hit the same GPQA score in the same month, raw capability numbers stop functioning as competitive moats. What differentiates Anthropic is not the 0.9 score — Meta and Alibaba have it too — but the product surface on top: Design, Cowork, and Code as an integrated suite targeting different stages of the build cycle.

The question for the next release cycle is whether open-weight models like GLM-5.1 will develop comparable application layers, or whether the pattern from earlier LLM generations repeats: capability parity on benchmarks, but proprietary products capturing most of the deployment value.

Frequently asked questions

What is GPQA and why does a 0.9 score matter?
GPQA (Graduate-Level Google-Proof Q&A) tests expert scientific reasoning on questions designed to resist internet lookup. A 0.9 score means the model answers correctly on 90 percent of questions that stump most PhD holders outside their specialty, making it one of the more rigorous capability benchmarks currently in wide use.

Is Claude Design a replacement for Figma?
Not directly, at least at launch. Claude Design produces structured layouts from prompts and brand assets, with exports to Canva, PDF, and PPTX. It integrates with Claude Code but does not replicate Figma's collaborative vector editing environment. Anthropic explicitly frames it as a complement to existing tools, including Canva.

Which April 2026 models are open source?
GLM-5.1 from Zhipu AI and Google's Gemma 4 family are open-weight releases. GLM-5.1 hits GPQA 0.9, while the Gemma 4 variants range from 0.4 to 0.8 depending on model size and architecture.

Who can access Claude Design right now?
Claude Design is available as a research preview for paid subscribers on Pro, Max, Team, and Enterprise tiers. Enterprise access is off by default and must be enabled by administrators. The rollout is gradual and was ongoing as of April 17.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn