TensorRT FP4 Quantization Targets DeepSeek R1 With 15x Speed Claim

TL;DR

How NVIDIA's TensorRT FP4 quantization pipeline for DeepSeek R1 works, what the 15x throughput claim actually means, and what practitioners need to validate before deploying.

TensorRT Model Optimizer now ships an experimental FP4 quantization workflow for DeepSeek R1, and NVIDIA is pairing that release with a striking throughput claim: the same model workload on Blackwell GB200 NVL72 hardware delivers 15 times the performance of an H200-based Hopper system. For teams pricing out inference infrastructure, that figure changes the TCO conversation significantly.

DeepSeek R1 is a sensible test case for aggressive quantization. The model's mixture-of-experts architecture keeps the majority of parameters inactive on any given forward pass, making memory bandwidth, rather than raw compute, the primary bottleneck during autoregressive generation. FP4 reduces weight storage by roughly a factor of eight relative to FP32 and by half again compared to FP8, easing that constraint substantially before hardware even factors in.

How the pipeline works

The optimization path runs through TensorRT Model Optimizer, which takes the full-precision model and produces an FP4 artifact that TensorRT-LLM can serve directly. An experimental integration also extends the workflow to vLLM, broadening the path for teams that depend on its batching and scheduling features in production. NVIDIA's developer portal has also published a separate evaluation of DeepSeek V3.2-Exp with fine-grained sparse attention in vLLM, suggesting the framework is becoming a fully supported second path alongside the proprietary NIM stack.

TensorRT-LLM remains the primary inference engine for data center deployments. The optimization chain from model optimizer to TensorRT-LLM to NIM is designed to let teams move from prototype to production without leaving NVIDIA's tooling. For operators already on Blackwell hardware, the 15x throughput figure bundles NVLink bandwidth improvements with quantization gains, so it should not be read as a purely software effect.

The quality caveat

Practitioners should treat the throughput claim with some skepticism. FP4 introduces greater precision loss than FP8 or BF16, and reasoning-heavy models are particularly exposed: numerical artifacts from low-bit quantization can surface in long-context chains of thought, which is exactly where R1 is designed to excel. NVIDIA has not published per-task quality degradation curves alongside the performance numbers, limiting independent verification of the tradeoff.

The vLLM integration carries a separate caution. Experimental status means failure modes under real production query distributions are not yet characterized. Teams that chose vLLM specifically to avoid single-vendor dependency should validate the integration against their own workloads before relying on it for live traffic.

Speed of the release cycle

The pace of open-weight model releases gives context for why this tooling update matters now. llm-stats.com shows DeepSeek V4-Flash and V4-Pro shipping in late April 2026, followed by further variants within weeks. Each new architecture requires updated quantization recipes, and optimization tooling has consistently lagged behind the model release cadence. Shipping FP4 support for R1 while the V4 series is already in deployment reflects an attempt to close that gap before it widens further.

At pricepertoken.com, inference pricing for DeepSeek variants has been falling steadily alongside the release cadence. If FP4 deployment delivers even half the claimed throughput improvement under real query distributions, operators running self-hosted clusters will see meaningful reductions in cost per token, which feeds directly into competitive pricing pressure across the market.

The artificial intelligence infrastructure stack has followed a consistent generational pattern: FP16 became standard on Volta, FP8 matured on Hopper, and FP4 appears positioned as the Blackwell-era default for large-scale serving. Teams building new clusters now should plan for FP4-capable workflows from the start rather than retrofitting later. Whether the vLLM integration reaches production stability quickly, and whether NVIDIA publishes the quality benchmarks practitioners need to justify the format switch, will determine how broadly this workflow spreads beyond operators already committed to the full NVIDIA stack.

FAQ

What is FP4 quantization and how does it differ from FP8?
FP4 uses four-bit floating-point representations for model weights instead of eight-bit. The halved storage requirement reduces memory bandwidth pressure during inference but introduces greater precision loss, which can affect output quality on tasks requiring sustained numerical accuracy such as long-context reasoning.

Does TensorRT FP4 quantization work with vLLM?
NVIDIA has shipped an experimental integration, meaning the workflow is functional but not yet validated for production traffic. Teams relying on vLLM's scheduling features should benchmark carefully before committing the pipeline to live inference workloads.

How reliable is the 15x throughput claim over H200?
The figure covers DeepSeek-R1 at specific context lengths under benchmark conditions that favor high hardware utilization. Real-world gains for typical query distributions are likely lower, and the number should be treated as an upper bound pending independent replication.

Which DeepSeek models support FP4 via TensorRT?
NVIDIA has detailed the workflow for DeepSeek R1 specifically. Support for the V4 model family has not been confirmed in the same depth, though TensorRT-LLM is designed to extend optimization tooling across model generations as they are released.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn