AIResearchAIResearch
Machine Learning

NVIDIA Claims 15x Throughput Gain for Open-Weight Models on Blackwell

NVIDIA's TensorRT-LLM, NIM, and FP4 quantization cut open-weight inference costs as NVIDIA Nemotron 3 Ultra ships and OpenAI expands frontier model access via AWS Bedrock.

3 min read
NVIDIA Claims 15x Throughput Gain for Open-Weight Models on Blackwell

TL;DR

NVIDIA's TensorRT-LLM, NIM, and FP4 quantization cut open-weight inference costs as NVIDIA Nemotron 3 Ultra ships and OpenAI expands frontier model access via AWS Bedrock.

NVIDIA's benchmark numbers for its Blackwell generation are striking: DeepSeek R1 running on a GB200 NVL72 cluster delivers 15x the inference throughput of the same model on Hopper H200 hardware. For teams evaluating where to run open-weight reasoning models in production, that gap is now too large to treat as a footnote in a hardware procurement debate.

The figure comes from NVIDIA's developer platform, which has expanded its catalog of community-built models tuned for NVIDIA infrastructure. The platform now covers DeepSeek's mixture-of-experts family, Google's Gemma series, and NVIDIA's own Nemotron line, each optimized through a combination of TensorRT-LLM compilation, NIM containerized microservices, and the open-source NeMo customization framework. Rapid API prototyping through NIM and full fine-tuning through NeMo sit on the same toolchain, which reduces the gap between experimentation and production.

For practitioners targeting DeepSeek at scale, NVIDIA has made FP4 quantization of DeepSeek R1 available through its TensorRT Model Optimizer, with experimental vLLM deployment now included. FP4 compresses model weights to four bits, cutting memory footprint substantially compared to FP16 baselines. Running a 671-billion-parameter MoE model on a finite cluster becomes a different proposition when each weight occupies a quarter of the space.

The quantization tradeoff

FP4 is not free. Aggressive four-bit quantization degrades performance on tasks that depend on numerical precision, and NVIDIA's own documentation labels the vLLM path as experimental. Any team using this path should run accuracy benchmarks against their specific workload before treating the throughput numbers as a final answer. The headline gain is a ceiling, not a floor.

NVIDIA has co-optimized Google's Gemma family with DeepMind for both Blackwell and Hopper hardware, extending performance benefits down to workstation-class deployments. Gemma's smaller parameter counts make it viable for local development loops, which matters for teams that cannot afford Blackwell access for iteration but want Blackwell behavior in production. The same TensorRT-LLM path applies across the hardware tiers.

Separately, Latent Space's AI newsletter flagged the release of NVIDIA Cosmos 3, Nemotron 3 Ultra, and the RTX Spark platform in the same reporting window. Nemotron 3 Ultra extends NVIDIA's strategy of shipping its own fine-tuned open-weight models alongside the inference toolchain, positioning the company as both hardware vendor and active participant in the open artificial intelligence model ecosystem rather than a passive accelerator of others' work.

Enterprise context

On the proprietary side, CIOL reported that OpenAI made GPT-5.5 and its Codex coding agent generally available on Amazon Web Services through Amazon Bedrock. The move puts frontier model access inside the compliance frameworks, billing systems, and security perimeters enterprises already operate. For engineering leaders comparing open-weight self-hosting against managed API access, that governance gap was historically a decisive factor in favor of proprietary deployments.

The two stories converge on the same structural shift. NVIDIA's optimization stack is making open-weight model deployment meaningfully cheaper per token at scale. Cloud providers are making proprietary model access easier to govern and audit. The artificial intelligence infrastructure layer is being compressed from both ends, and the decision calculus for practitioners is changing as a result.

For ML engineers and applied scientists, the relevant question is no longer which model family to pick but which cost-performance curve fits the specific job. A 15x throughput multiplier changes open-weight economics decisively, but only for teams with Blackwell access, which remains constrained for most organizations. Teams on Hopper hardware can still use TensorRT-LLM and NIM to extract meaningful efficiency gains over naive HuggingFace inference setups. Quantization quality and deployment tooling have joined parameter count and benchmark scores as first-class evaluation criteria, as llm-stats.com's model tracker illustrates with its growing coverage of inference-focused release metrics.

As open-weight models close the quality gap with proprietary alternatives on a widening share of tasks, the forward-looking question is whether inference cost advantages will eventually outweigh the governance convenience of managed APIs for the majority of enterprise workloads, or whether the compliance overhead of self-hosted deployment will keep the two paths roughly co-equal regardless of raw performance.

---

FAQ

What is TensorRT-LLM and how does it accelerate inference?
TensorRT-LLM is NVIDIA's open-source library that compiles transformer models into GPU-optimized execution graphs. It applies kernel fusion, weight quantization, and hardware-specific scheduling to reduce latency and increase throughput compared to standard PyTorch or HuggingFace Transformers inference pipelines.

What does FP4 quantization mean for model accuracy?
FP4 represents weights in four-bit floating point rather than the standard sixteen-bit format. Memory footprint drops and throughput rises, but performance on precision-sensitive tasks can degrade. NVIDIA's FP4 path for DeepSeek R1 via vLLM is currently marked experimental and requires workload-specific validation before production use.

How does NVIDIA NIM differ from deploying a model with standard containers?
NIM packages optimized models as containerized APIs with OpenAI-compatible endpoints, pre-configured TensorRT-LLM runtimes, and managed dependency stacks. Teams skip manual compilation and runtime configuration, trading some flexibility for a significantly faster path from model download to serving endpoint.

Is the 15x gain applicable to all hardware and context lengths?
No. The 15x figure compares Blackwell GB200 NVL72 against Hopper H200 at specific DeepSeek R1 context lengths of 8K and 1K tokens. Real-world throughput depends on batch size, sequence length distribution, and workload mix. Treat the headline number as a best-case upper bound under well-optimized production conditions.

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn