AIResearch AIResearch
Back to articles
AI

AI Reranking Breakthrough Cuts Latency, Boosts Engagement

AI now predicts what videos you want to watch four times faster, revolutionizing streaming and e-commerce recommendations. This speed boost maintains accuracy while serving content more efficiently to billions of users daily.

AI Research
November 14, 2025
3 min read
AI Reranking Breakthrough Cuts Latency, Boosts Engagement

In the competitive world of online recommendations, every millisecond counts. A new AI framework called GReF has achieved what many thought impossible: significantly improving recommendation quality while maintaining near-instantaneous response times. This breakthrough addresses a fundamental tension in modern recommender systems—the trade-off between accuracy and speed—and has already been deployed in a major video app serving 300 million daily active users, delivering measurable improvements in user engagement.

The researchers developed GReF (Generative Reranking Framework) to overcome key limitations in current recommendation systems. Traditional approaches either use one-stage methods that struggle with accuracy or two-stage methods that suffer from slow inference times. GReF introduces a unified generative approach that eliminates the separation between candidate generation and evaluation, bridging what the paper identifies as the "gap between generator and evaluator." This integration allows the system to maintain the expressiveness of complex models while achieving practical deployment speeds.

At the core of GReF is the Gen-Reranker model, which uses a transformer-based architecture with an encoder that processes candidate items and a decoder that generates recommendations in an autoregressive manner. The innovation lies in how the system handles the recommendation process. Instead of treating each item selection as a separate step, GReF employs Ordered Multi-Token Prediction (OMTP), allowing it to predict multiple future items simultaneously while preserving their sequential order. This approach reduces the number of decoding passes needed, cutting inference latency by approximately half compared to traditional autoregressive methods.

The training process involves two key stages. First, the model undergoes pre-training on large-scale unlabeled exposure data from existing recommender systems, similar to how large language models are trained on internet text. This provides high-quality initialization and captures broader user interest patterns. Second, the system uses Rerank-DPO (Direct Preference Optimization), which constructs pairwise comparisons between user-preferred and less-preferred sequences to integrate explicit user feedback without requiring additional evaluation models.

Experimental results demonstrate GReF's superiority across multiple metrics. On the Avito dataset, GReF achieved approximately 1.5% higher AUC and 0.8% higher NDCG compared to the next best baseline method. On the Kuaishou dataset, it showed 1.4% higher AUC and 0.5% higher NDCG improvements. More importantly, inference time measurements showed GReF operating at 12.97 milliseconds—nearly comparable to non-autoregressive methods while maintaining significantly higher accuracy. Without OMTP, the same model would require 24.29 milliseconds, highlighting the efficiency gains from the multi-token prediction approach.

The real-world impact became evident when GReF was deployed in Kuaishou's production environment. Online A/B testing over one week showed substantial improvements across all engagement metrics: views increased by 0.33%, long views (indicating prolonged engagement) rose by 0.42%, likes improved by 1.19%, forwarding (sharing) surged by 2.98%, and comments increased by 1.78%. These results suggest that better recommendation quality not only improves content visibility but also fosters more interactive and participatory user experiences.

Despite these achievements, the paper acknowledges limitations. The framework relies heavily on high-quality pre-training data, and directly applying reinforcement learning-based methods to user feedback data in cold-start scenarios could lead to instability. Additionally, while OMTP significantly improves efficiency, the model still requires careful tuning of multiple hyperparameters across different training stages.

This work represents a significant step toward practical deployment of advanced AI in real-time recommendation systems. By addressing both accuracy and efficiency challenges simultaneously, GReF demonstrates how generative approaches can transform digital experiences while meeting the stringent latency requirements of modern applications. The successful deployment at scale suggests this framework could influence how recommendation systems are designed across various platforms, from e-commerce to content streaming services.

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn