One AI Model Now Handles Object Detection and Generation

TL;DR

Kuaishou's unified framework beats rivals at fine-grained visual tasks, with real gains for e-commerce search and robotics.

In the fast-evolving world of computer vision, achieving a unified approach to object recognition has long been a holy grail, with applications spanning from autonomous vehicles to online shopping. Current systems often rely on disjointed modules that handle detection, categorization, and attribute analysis separately, leading to inefficiencies and inaccuracies, especially in complex scenarios like e-commerce where products vary widely in appearance. This fragmentation has spurred researchers to seek integrated solutions that can seamlessly reason across hierarchical levels, from broad categories to specific details like color and material. The recent push towards generative AI offers a promising path forward, blending visual perception with language-based generation to create more coherent and detailed understandings of images. Now, a team from Kuaishou Technology has unveiled a groundbreaking framework that marries object detection with hierarchical generative modeling, setting new benchmarks in accuracy and scalability for real-world applications.

At the heart of this innovation is the UniDGF (Unified Detection-to-Generation Framework), which fundamentally rethinks how machines interpret visual data by combining a YOLO-based object detector with a BART-inspired generative model. The process begins with the detector identifying and localizing all objects in an image using bounding boxes, after which ROI Align extracts refined region-specific features from a global visual backbone. These features are then fed into a Q-Former module that uses learnable queries to produce a fixed-length object representation, capturing essential visual semantics. This representation serves as the input to the generative component, which autoregressively outputs a sequence of tokens representing hierarchical categories and attributes—such as moving from 'Women's Apparel' to 'Chiffon Blouse' and then to specific properties like 'style: cute'. Crucially, the framework supports property-conditioned attribute recognition, allowing users to specify which attributes to generate, and it employs a compact token vocabulary that enhances efficiency over traditional natural language processing s. This end-to-end pipeline not only streamlines inference but also enables fine-grained reasoning that previous cascaded systems struggled to achieve, as detailed in the paper's ology section.

The experimental demonstrate UniDGF's superior performance across multiple datasets, including MSCOCO, Objects365, and the proprietary Products7417 e-commerce dataset. When evaluated with ground-truth bounding boxes, UniDGF with a BART-based generator achieved category accuracy improvements of up to 14.47% on Objects365 and 13.04% on MSCOCO compared to embedding-retrieval models like CLIP, while maintaining competitive attribute accuracy around 23-31%. On the challenging Products7417 dataset, which features 7,417 categories and diverse attributes, the model reached 61.39% category accuracy and 31.00% attribute accuracy, significantly outpacing prompt-based multimodal LLMs such as Qwen2.5-VL-3B, which faltered with large vocabularies. In end-to-end tests without ground-truth boxes, UniDGF also boosted detection performance, with mAP scores rising to 56.39% on MSCOCO and 33.67% on Objects365, thanks to more precise category predictions that filtered redundant detections. Ablation studies further validated key design choices, showing that a 7x7 ROI feature size with 128 Q-Former tokens optimized the balance between detail and efficiency, underscoring the framework's robustness in handling large-scale, attribute-rich environments.

Of this research are profound, particularly for industries reliant on precise visual analysis, such as e-commerce, where improved attribute recognition can enhance search functionality, recommendation systems, and inventory management. By enabling coherent coarse-to-fine semantic understanding, UniDGF could reduce the need for manual annotations and support more intuitive human-AI interactions, like allowing shoppers to query specific product features directly from images. Beyond retail, the framework's generative approach paves the way for advancements in robotics and autonomous systems, where real-time object understanding with nuanced attributes is critical for tasks like navigation and manipulation. The authors' decision to release validation data and pre-trained models will likely accelerate adoption and innovation, fostering a community-driven push towards fully unified vision-language systems. As AI continues to permeate daily life, such integrated models represent a significant step toward machines that not only see but comprehend the world in rich, hierarchical detail.

Despite its successes, the UniDGF framework has limitations, as noted by the researchers. The dependency on high-quality bounding box predictions means that detection errors can propagate through the generative stages, potentially affecting overall accuracy in noisy or cluttered scenes. Additionally, the model's performance on attribute recognition, while improved, still lags behind category prediction in some cases, especially with decoder-only variants like Pythia, suggesting that bidirectional encoding may be necessary for complex attribute reasoning. The current approach also requires extensive computational resources, such as training on 8 NVIDIA H800 GPUs, which could limit accessibility for smaller organizations. Looking ahead, the team plans to extend the generative paradigm to the detection phase itself, aiming for a fully end-to-end system that unifies localization and recognition without intermediate steps. This future direction could address existing bottlenecks and further enhance the framework's applicability to dynamic, real-world environments, pushing the boundaries of what's possible in AI-driven visual understanding.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn