AIResearch AIResearch
Back to articles
AI

AI Models Get Twice as Fast Without Losing Smarts

ServiceNow researchers replace key components in language models to double processing speed while maintaining reasoning capabilities, addressing critical bottlenecks in enterprise AI deployment.

AI Research
November 05, 2025
2 min read
AI Models Get Twice as Fast Without Losing Smarts

Large language models power everything from chatbots to coding assistants, but their speed limitations have become a major roadblock for real-world applications. ServiceNow researchers have developed a method that makes these AI models twice as fast while preserving their complex reasoning abilities, potentially transforming how businesses deploy artificial intelligence.

The key finding demonstrates that replacing specific components in transformer-based language models with more efficient state space models (SSMs) can dramatically increase processing speed with minimal performance loss. The Apriel-H1-15B-Thinker model achieved over 2× higher throughput when deployed in production environments while maintaining strong performance on reasoning-intensive tasks like mathematics, coding, and scientific problem-solving.

Researchers used a staged distillation approach, progressively replacing multi-head attention layers with Mamba SSM layers in the Apriel-Nemotron-15B-Thinker model. They employed two methods to determine which layers to replace: leave-one-out analysis that evaluates each layer's importance by temporarily removing it, and MMR (MIL-Mamba-Replacement) that measures how replacing specific layers affects distillation performance. The team replaced layers in order of increasing importance, starting with the least critical components.

Results show a clear trade-off between efficiency and performance. As shown in Figure 4, models with more SSM layers achieved significantly higher throughput—the Apriel-H1-40 model reached 3.4× the speed of the original transformer baseline. The Apriel-H1-30/50-15B-Thinker-SFT variant, fine-tuned on high-quality reasoning traces, nearly matched the teacher model's average performance while doubling throughput. Figure 3 illustrates that this model maintained strong scores across mathematics benchmarks (MATH-500, GSM8k), coding tasks (MBPP), and scientific reasoning (GPQA) with only minimal performance degradation.

This breakthrough matters because current AI models struggle with scalability under high request loads and long-context reasoning scenarios. The memory and computational constraints of traditional transformers limit practical adoption in multi-user environments and agentic applications where maintaining high throughput is crucial. The new approach enables cost-effective serving for concurrent users while dramatically improving latency for business-critical interactive applications.

The method does have limitations. The researchers note that effective knowledge transfer requires substantially more tokens than typical base-model distillation, suggesting that transferring multi-step reasoning behaviors demands greater exposure. Additionally, while the hybrid approach provides a stable pathway toward efficiency improvements, from-scratch pretraining of SSM-Transformer models—as exemplified by Nemotron-Nano-9B-v2—can push performance frontiers further but requires orders of magnitude more compute and data resources, making it a riskier endeavor compared to the established teacher-student distillation approach.

Original Source

Read the complete research paper

View on arXiv

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn