AI Models Run Twice as Fast Without Losing Accuracy

TL;DR

ServiceNow researchers doubled AI processing speed by replacing key model components, keeping reasoning intact and cutting a major enterprise bottleneck.

Large language models power everything from chatbots to coding assistants, but their speed limitations have become a major roadblock for real-world applications. ServiceNow researchers have developed a method that makes these AI models twice as fast while preserving their complex reasoning abilities, potentially transforming how businesses deploy artificial intelligence.

The key finding demonstrates that replacing specific components in transformer-based language models with more efficient state space models (SSMs) can dramatically increase processing speed with minimal performance loss. The Apriel-H1-15B-Thinker model achieved over 2× higher throughput when deployed in production environments while maintaining strong performance on reasoning-intensive tasks like mathematics, coding, and scientific problem-solving.

Researchers used a staged distillation approach, progressively replacing multi-head attention layers with Mamba SSM layers in the Apriel-Nemotron-15B-Thinker model. They employed two methods to determine which layers to replace: leave-one-out analysis that evaluates each layer's importance by temporarily removing it, and MMR (MIL-Mamba-Replacement) that measures how replacing specific layers affects distillation performance. The team replaced layers in order of increasing importance, starting with the least critical components.

Results show a clear trade-off between efficiency and performance. As shown in Figure 4, models with more SSM layers achieved significantly higher throughput—the Apriel-H1-40 model reached 3.4× the speed of the original transformer baseline. The Apriel-H1-30/50-15B-Thinker-SFT variant, fine-tuned on high-quality reasoning traces, nearly matched the teacher model's average performance while doubling throughput. Figure 3 illustrates that this model maintained strong scores across mathematics benchmarks (MATH-500, GSM8k), coding tasks (MBPP), and scientific reasoning (GPQA) with only minimal performance degradation.

This breakthrough matters because current AI models struggle with scalability under high request loads and long-context reasoning scenarios. The memory and computational constraints of traditional transformers limit practical adoption in multi-user environments and agentic applications where maintaining high throughput is crucial. The new approach enables cost-effective serving for concurrent users while dramatically improving latency for business-critical interactive applications.

The method does have limitations. The researchers note that effective knowledge transfer requires substantially more tokens than typical base-model distillation, suggesting that transferring multi-step reasoning behaviors demands greater exposure. Additionally, while the hybrid approach provides a stable pathway toward efficiency improvements, from-scratch pretraining of SSM-Transformer models—as exemplified by Nemotron-Nano-9B-v2—can push performance frontiers further but requires orders of magnitude more compute and data resources, making it a riskier endeavor compared to the established teacher-student distillation approach.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn