Mercury 2: Inception Labs' Diffusion LLM at 1000 Tokens/Second

Inception Labs just released Mercury 2, hitting 1,009 tokens per second on NVIDIA Blackwell GPUs. That's not just fast. It's a fundamental departure from how every major language model works today.

Rather than predicting tokens one at a time like GPT-4 or Claude, Mercury 2 refines multiple tokens simultaneously through iterative denoising, a technique borrowed from diffusion models. The result: according to Inception Labs' official announcement, five times faster reasoning than leading speed-optimized competitors like GPT 5.2 Mini and Claude 4.5 Haiku, while maintaining comparable output quality.

Speed has always forced a choice in AI: deploy a fast but dumb model, or accept latency for intelligence. Mercury 2 breaks that tradeoff. The model ranks second on Copilot Arena's code quality benchmarks despite its extraordinary throughput.

With a 128K context window and native tool use support, it handles complex reasoning tasks that autoregressive models tackle at a fraction of the latency. This matters because real-time applications like voice interfaces, code completion, and agentic workflows have been forced to choose smaller, less capable models. Mercury 2 eliminates that compromise.

The architecture itself is what makes this possible. Instead of generating one token per pass through the model, Mercury 2's diffusion-based approach treats decoding as a refinement problem. It starts with a draft output and iteratively denoises it over a small number of steps, refining multiple positions simultaneously.

This parallelization is why GPUs with massive compute budgets like Blackwell can push throughput so high. The arXiv paper details how this iterative refinement achieves reasoning-grade quality without sequential prediction bottlenecks.

Pricing is aggressive. Inception Labs charges $0.25 per million input tokens and $0.75 per million output tokens, making Mercury 2 dramatically cheaper than competing reasoning models while delivering superior speed.

The model is available now, with OpenAI API compatibility for seamless production deployment. Support for schema-aligned JSON outputs and tool calling makes it practical for everything from RAG systems to agentic workflows.

What's genuinely significant here is the architectural shift this represents. For years, the LLM industry has optimized around the autoregressive transformer, treating token-by-token prediction as foundational. Mercury 2 proves that diffusion-based parallel decoding isn't just a theoretical alternative — it's production-ready and superior in practice for latency-sensitive applications. The real question isn't whether other labs will copy this approach, but how quickly the autoregressive paradigm will give way to diffusion-based architectures across the industry.