AIResearchAIResearch
Machine Learning

DeepSeek open-sources V4 LLMs with 1.6T-parameter flagship

DeepSeek's V4 series introduces hybrid attention compression and sparse MoE activation, cutting inference memory costs for open-weight LLM deployments.

3 min read
DeepSeek open-sources V4 LLMs with 1.6T-parameter flagship

TL;DR

DeepSeek's V4 series introduces hybrid attention compression and sparse MoE activation, cutting inference memory costs for open-weight LLM deployments.

DeepSeek released two open-source large language models on April 24, pushing its V4 series into the growing stack of competitive open-weight options. The flagship V4-Pro holds 1.6 trillion parameters and activates 49 billion during inference. V4-Flash, the smaller variant, carries 284 billion total parameters with 13 billion active at runtime.

Both models run on a mixture-of-experts architecture, where a learned routing mechanism directs each token through a subset of specialized subnetworks rather than the full model. This is not a new pattern for DeepSeek, but the scale of sparsity is notable: V4-Pro activates roughly 3% of its total parameters per inference pass, which is aggressive even by MoE standards, as SiliconAngle reports.

The attention mechanism

V4's most consequential technical claim centers on key-value cache management. Standard transformer attention encodes each prompt into a KV cache that must be held in GPU memory throughout inference, growing with context length and batch size. DeepSeek's hybrid attention design applies two distinct compression schemes to this cache, cutting its memory footprint by 90% relative to prior-generation models, SiliconAngle notes. For teams running long-context workloads or serving many concurrent users, that compression ratio could translate directly into lower per-request hardware costs.

Training-side changes were also announced, though details remain sparse. DeepSeek has documented inter-layer data movement optimization in previous releases, and V4 continues in that direction. The lab has not disclosed training compute or dataset composition, which limits independent analysis of whether the efficiency gains represent genuine algorithmic progress or a favorable data regime.

A crowded release window

The V4 launch hit during one of the busiest model weeks on record. According to llm-stats.com, April 23 alone saw four major drops simultaneously, including two GPT-5.5 variants from OpenAI. DeepSeek shipped four models that same day, with V4-Pro-Max and V4-Flash-Max variants joining the base versions, mirroring the tiered product structures that OpenAI and Anthropic have built for their own families.

On the deployment side, both core V4 models appeared on OpenRouter within hours of the announcement, as tracked by Price Per Token. Fast API availability has become standard practice for open-weight releases, but it still signals how quickly open-source models have closed the gap with proprietary APIs on deployment convenience, not just benchmark numbers.

What the benchmarks don't yet say

Independent evaluation results for V4 are not available. The GPQA scores appearing in model-tracking databases are listed identically as 0.9 across nearly every recent release, suggesting placeholder data rather than real evaluation. Practitioners should wait for third-party testing on coding, math, and long-context retrieval before drawing comparisons to GPT-5.5 or Claude Opus 4.7, both released in the same window.

The broader significance of DeepSeek's approach to artificial intelligence development lies in its efficiency ratios rather than raw parameter counts. If a 3% activation rate delivers output quality that matches larger dense models, the economics of inference shift for anyone running deployments at scale. Hardware investment forecasts built around dense transformer scaling look different when the marginal cost of a capable open-weight model keeps falling.

Regulatory complexity also enters the picture. The European Union's Artificial Intelligence Act places disclosure and risk-assessment obligations on open-weight models above certain capability thresholds. A model with 1.6 trillion parameters, even with sparse activation, raises genuine questions about which classification criteria apply, and V4 will likely be an early test case for how regulators interpret those boundaries.

Whether DeepSeek's release cadence is sustainable, given its consistent opacity about training costs, remains the more important question.

Frequently asked questions

What is the difference between DeepSeek V4-Pro and V4-Flash?
V4-Pro is the flagship with 1.6 trillion total parameters and 49 billion active during inference. V4-Flash is a smaller, lower-cost model with 284 billion total parameters and 13 billion active, trading some output quality for reduced hardware requirements.

What is mixture-of-experts architecture in LLMs?
MoE models contain multiple specialized subnetworks. Instead of running a prompt through the full model, a routing mechanism selects a small subset of subnetworks for each token, keeping active compute far lower than total parameter count suggests.

How does DeepSeek V4 reduce KV cache memory usage?
V4 uses a hybrid attention mechanism that applies two distinct compression strategies to the key-value cache, cutting its inference-time memory footprint by 90% compared to DeepSeek's previous generation of models.

Where can I run DeepSeek V4?
Both V4-Pro and V4-Flash are available through OpenRouter. As open-weight models they can also be self-hosted, though V4-Pro's 1.6 trillion total parameters means hardware requirements will be substantial even with sparse activation.

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn