Six AI Models Push Cost and Throughput Benchmarks in 2025

TL;DR

Six recently released AI models including Nvidia Nemotron 3 and GLM-5.2 raise the competitive bar on pricing, context length, and agentic throughput for production ML teams.

While GPT-5.2 and Gemini 3 Pro absorbed most of the attention last quarter, at least six other models landed in the same release window with capabilities practitioners should not overlook. Most are open-weights, and at least one is quietly rewriting throughput economics for agentic pipelines.

Nvidia's Nemotron 3 family stands out as the most architecturally significant of the batch. The lineup targets multi-agent deployments specifically: setups where specialized agents exchange context, invoke tools, and coordinate across long task horizons. Its smallest variant, Nemotron 3 Nano, carries roughly 30 billion parameters but applies a mixture-of-experts design that activates only about 3 billion parameters per token. Per Humanity Redefined, Nvidia reports up to 4x throughput improvement over the prior Nano generation alongside roughly 60% fewer reasoning tokens per output, a combination that can meaningfully reduce inference costs at scale.

Nemotron 3 Super, at roughly 100 billion parameters, targets the reasoning-heavy orchestration role in multi-agent pipelines. At the top, Ultra reaches 550 billion total parameters with 55 billion active per token. Price Per Token lists Ultra via Nebius at $1.00 per million input tokens and $3.00 per million output with a 1 million token context window, pricing that holds up against models considerably smaller in theoretical scale.

Beyond Nemotron

Z.ai's GLM-5.2 attracted disproportionate attention in the frontend engineering community. Price Per Token highlighted Latent Space's framing of it as the top frontend coding model globally, though that claim awaits rigorous third-party validation. MiniMax M3 also landed this cycle, available on Together and Morph at $0.30 to $0.60 input and $1.20 to $2.40 output per million tokens, with a 1 million token context window suited for long-document workflows.

Stepfun's step-3.7-flash enters at $0.20 input and $1.15 output per million tokens with a 256K context window via DeepInfra and Novita. Qwen3.7 Plus rounds out the open tier at $0.32 in and $1.28 out with a 1 million token context through Together. None command household recognition, but each occupies a specific position in the cost-performance matrix that applied scientists actually optimize against.

Google shipped two image-generation variants in quick succession, both available via Google AI Studio. Mapify's model guide positions Gemini 3.1 Flash Image at $0.50/$3.00 per million tokens for rapid creative iteration, while Gemini 3 Pro Image at $2.00/$12.00 targets higher-fidelity multimodal output. The pairing follows a familiar speed-quality tradeoff Google has pursued across the entire Gemini 3 lineup.

Why portfolio diversity matters now

Recent reliability data adds urgency to the case for knowing your alternatives. International Business Times reported that Anthropic's Claude saw error rates reach roughly 10 percent across Opus, Sonnet, and Haiku simultaneously during a June 16 disruption. Between June 8 and June 17, Anthropic logged multiple incidents, some occurring within hours of each other, with effects spanning the chatbot, API, and developer tooling. Anthropic has since restored service, but the episode surfaced a risk that artificial intelligence platform buyers routinely underweight: single-provider dependency during production workloads.

A 10 percent error rate is recoverable; a workflow with no tested fallback is not. Teams that have already evaluated alternatives from Nvidia, Qwen, or MiniMax can reroute quickly, while those who haven't are left watching status pages.

Across all six releases, a pattern is consolidating. Mixture-of-experts architectures, 1 million token context windows, and sub-dollar input pricing are becoming table stakes rather than differentiators. What remains genuinely uncertain is how well these models hold up beyond standard artificial intelligence benchmark suites: complex multi-step reasoning, rare-domain knowledge, and long-horizon planning all remain areas where evaluation scores and real practitioner experience diverge in ways the research community has not resolved.

The throughput and cost claims are verifiable. The reasoning claims require your production workloads to stress-test, not a spec sheet to read.

FAQ

What is Nvidia Nemotron 3?
Nemotron 3 is Nvidia's open model family built for agentic and multi-agent systems. Its three variants span 30B to 550B parameters using mixture-of-experts designs, with the Nano variant claiming 4x throughput over its predecessor at roughly 60% fewer reasoning tokens per output.

What is GLM-5.2 and who makes it?
GLM-5.2 is a large language model from Z.ai, available on OpenRouter. It has gained practitioner attention in frontend coding tasks, though independent cross-model benchmark comparisons remain limited as of mid-2026.

What caused the Anthropic Claude outages in June 2026?
Anthropocentric attributed the disruptions to elevated error rates and capacity issues. Between June 8 and June 17, multiple incidents affected Opus, Sonnet, and Haiku simultaneously, with some occurring within hours of each other before service was restored.

How does MiniMax M3 compare to other long-context models?
MiniMax M3 offers a 1 million token context window at $0.30 to $0.60 per million input tokens, positioning it as a cost-competitive option for long-document processing alongside similarly priced alternatives from Qwen and Stepfun.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn