AIResearch AIResearch
Back to articles
AI

AI Queries Slash Response Times by 20x

A new method transforms AI server efficiency, cutting delays in complex tasks like data analysis and chatbots without compromising accuracy.

AI Research
November 05, 2025
3 min read
AI Queries Slash Response Times by 20x

Artificial intelligence systems are increasingly handling complex tasks beyond simple chat, such as data retrieval and multi-step reasoning, but their underlying servers struggle with speed, leading to frustrating delays for users. Researchers have introduced a novel approach called span queries that dramatically improves how these systems process information, achieving up to 20 times faster response times in non-chat applications. This innovation addresses a critical bottleneck in AI deployment, making real-time interactions more feasible for everything from customer service to scientific analysis.

The key finding is that span queries enable AI servers to optimize how they handle diverse workloads by structuring inputs as expression trees with constraints on whether the order of data matters. For instance, in chat scenarios, input order is fixed, but in retrieval-augmented generation (RAG) or nested generation tasks, inputs can often be reordered without affecting outcomes. By leveraging commutativity—where operations like A followed by B yield the same result as B followed by A—the system automatically reorganizes queries to reduce redundant computations. This led to time-to-first-token (TTFT) reductions of 10-20x in benchmarks, far exceeding the 3-4x gains from prior methods like CacheBlend.

Methodologically, the team developed span queries as a declarative intermediate representation that captures the structure of AI tasks, such as RAG and agentic workflows. They modified the popular vLLM inference server with only 492 lines of code changes, focusing on components like the scheduler and GPU runner layers. The approach includes an optimizer that rewrites query trees to exploit commutativity, a tokenizer that encodes queries into sequences with special tokens for boundaries, and algorithms like CIDRA for efficient repositioning of cached data. This minimizes the prefill GPU load and avoids the 'dual output paradox,' where servers cache different data than what clients receive.

Results analysis from the paper shows significant performance improvements. In RAG tests, span queries reduced TTFT from quadratic growth to near-linear, even on cache misses, and achieved up to 20x speedups on hits. For nested generation, such as judge-generator workflows, speedups ranged from 7% to 13x depending on parameters like temperature and fan-out. Figure 13b illustrates that in RAG with 32 documents, TTFT dropped sharply with span queries, while stock vLLM slowed dramatically. Additionally, attention locality improved, with a 2-billion-parameter model using optimized queries outperforming an 8-billion-parameter model on accuracy in 'lost-in-the-middle' tests, where models typically struggle with data in long sequences.

In context, this matters because it enhances the practicality of AI in everyday applications, from faster chatbots to more efficient data analysis tools, without requiring expensive hardware upgrades. By generalizing across chat, RAG, and agentic tasks, the method supports emerging uses like real-time decision-making in healthcare or finance, where speed and accuracy are crucial. It also addresses environmental concerns by reducing computational waste, aligning with broader goals of sustainable AI development.

Limitations noted in the paper include the need for fine-tuning when using novel tokens in queries, which could affect model accuracy, and the overhead of repositioning algorithms in highly concurrent scenarios. The approach does not fully resolve issues with partially filled cache blocks, and stability can decrease at the 99th percentile of fan-out in nested generation. Future work could explore integrating gather/scatter operations for further optimizations, but current implementations rely on the existing vLLM framework with minimal changes.

Original Source

Read the complete research paper

View on arXiv

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn