AI Serves Large Models Faster with Smarter Replication

TL;DR

A new method allocates AI model replicas based on need, boosting inference speed by up to 20% while using far less memory than current techniques.

Large language models built with Mixture-of-Experts (MoE) architectures have become a cornerstone of modern AI, powering everything from chatbots to image generators. These models work by routing different parts of an input, like words in a sentence, to specialized sub-networks called 'experts.' However, this routing creates a critical bottleneck during deployment: a few popular experts become overloaded with requests, while others sit idle, slowing down the entire system. This load imbalance forces expensive hardware to wait, driving up operational costs and limiting how quickly these powerful models can respond. A new framework called CRAFT tackles this problem head-on by making expert replication—a common technique to spread the load—far more efficient, delivering significant speedups without the heavy memory toll of existing s.

Researchers discovered that the standard approach to load balancing, known as Expert Parallelism Load Balancer (EPLB), often allocates replicas wastefully. EPLB creates one replica for every expert layer on each graphics processing unit (GPU), which consumes substantial memory. The study found that this uniform replication leads to diminishing returns; many of these replicas provide little to no improvement in balancing the workload. In fact, the data shows that as the number of replicas increases, the gains in load 'balancedness'—a metric where higher values mean more even distribution—rapidly plateau. For example, in tests with the 1000-billion-parameter Kimi-K2 model on 64 GPUs, doubling replicas beyond 16 offered negligible balancedness gains but doubled memory usage. This inefficiency is costly because GPU memory is also needed to store the model's 'KV cache,' which is essential for processing long conversations or documents efficiently.

The key innovation of CRAFT is its fine-grained, benefit-driven approach. Instead of applying replication uniformly, it analyzes the load distribution for each expert layer offline and estimates how much each layer would benefit from having additional copies. The framework categorizes layers into two types: 'high-skew' layers, where a few experts dominate the token load, and 'low-skew' layers, where the load is already relatively balanced. For instance, in the DeepSeek-R1 model, layer 51 was a high-skew layer where the hottest expert received over 27 times the average load, making it a prime candidate for replication. In contrast, layer 20 had a balanced distribution and gained little from extra copies. CRAFT uses a cost model to estimate this per-layer 'replication benefit' and then employs dynamic programming to allocate a limited budget of replicas to the layers that will benefit the most, maximizing overall load balance under strict memory constraints.

Evaluation demonstrate CRAFT's effectiveness across multiple large models and datasets. When integrated into the SGLang serving framework and tested on an 8-node cluster with NVIDIA A100 GPUs, CRAFT increased end-to-end inference throughput by an average of 1.14 times compared to EPLB, with gains reaching up to 1.2 times. In specific configurations, such as with the Kimi-K2 model and German text data, CRAFT achieved a 1.17 times higher throughput. The framework also maintained strong performance across different cluster sizes. When scaling from 6 to 12 nodes, CRAFT's goodput—the maximum sustained throughput before delays occur—scaled by 1.6 times on average, outperforming the baseline. Importantly, CRAFT achieved these speedups while using significantly fewer replicas. For the DeepSeek-R1 and Kimi-K2 models, it allocated 7.25 and 7.5 times fewer replicas than EPLB, respectively, preserving most of the balancedness gains and allowing for a larger KV cache to handle more concurrent requests.

Of this research are substantial for the cost and accessibility of large AI model deployment. By optimizing replica allocation, CRAFT reduces the GPU memory footprint required for efficient inference, which can lower operational expenses in data centers. is designed to integrate seamlessly into existing serving frameworks like SGLang, TensorRT-LLM, and vLLM without requiring any changes to the AI models themselves or additional training. This makes it a practical upgrade for companies running MoE-based models at scale. Furthermore, the framework's robustness across datasets with varying skewness—from highly imbalanced German and Japanese text to more balanced academic writing—shows its broad applicability. As AI models continue to grow in size and complexity, techniques like CRAFT that improve hardware utilization without sacrificing performance will be crucial for sustainable scaling.

Despite its advantages, the study acknowledges certain limitations. CRAFT's performance relies on an offline profiling phase to estimate expert load distributions, which assumes these patterns are relatively stable. The paper notes that the framework could be extended with online periodic rebalancing to adapt to changing workloads, though this was not evaluated in depth. Additionally, while CRAFT significantly reduces memory overhead compared to uniform replication, it still incurs some memory cost for the replicas it does allocate, which must be balanced against the available KV cache. The research also highlights that expert placement strategies—which arrange experts across devices to balance load—remain important and can be combined with CRAFT for further optimization. However, placement alone cannot handle extreme load skew, underscoring the need for intelligent replication as provided by CRAFT. Future work could explore integrating CRAFT with other load-balancing techniques, such as expert sharding or routing prediction, to push efficiency even further.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn