AI Rewrites Vague Queries to Find Images More Accurately

TL;DR

A new method turns ambiguous requests into clear search terms, improving image retrieval accuracy by up to 15% across complex multi-turn conversations.

When you ask a friend for a photo from yesterday's soccer match, you might say, 'Send me that scene on a cloudy day.' To a human, this makes perfect sense—but to an AI image search system, it's confusing gibberish. This common conversational has long stumped artificial intelligence systems, which struggle to interpret the vague references and incomplete expressions that characterize natural human dialogue. Now, researchers have developed a solution that could transform how AI handles these complex interactions: conversational query rewriting, a technique that transforms ambiguous requests into clear, searchable terms.

Researchers from Soochow University have demonstrated that rewriting conversational queries can dramatically improve image retrieval accuracy. In their experiments, when AI models transformed vague user requests into self-contained queries, retrieval performance improved significantly across all tested models. For instance, on their newly created ReCQR dataset, the best-performing models achieved Recall@1 scores of up to 19.6% on text-only dialogues and 13.2% on multimodal dialogues, compared to just 3.6% and 3.2% respectively when using the original ambiguous queries. This represents a substantial improvement, showing that query rewriting effectively bridges the gap between how humans naturally communicate and what AI systems need to perform accurate searches.

The researchers developed a sophisticated two-stage pipeline to create training data for this task. First, they generated text-only dialogues where users make references that require understanding previous conversation turns. They used large language models to create realistic multi-turn conversations based on images from the MSCOCO dataset, then deliberately made the final queries ambiguous by removing information that could be inferred from the dialogue history. This simulated how people naturally speak in conversations, using pronouns and references rather than fully specified descriptions. Second, they created more complex multimodal dialogues where users reference multiple images, requiring AI to understand both textual conversation history and visual content from previously shared photos.

Experimental from Table 3 reveal several important patterns. The performance gap between original queries and rewritten queries is substantial: original queries achieved only 3.6% Recall@1 on text-only data and 3.2% on multimodal data, while the best rewritten queries reached 19.6% and 13.2% respectively. This confirms that query rewriting addresses a fundamental limitation in current systems. The researchers also found that fine-tuning models on their ReCQR dataset yielded significant improvements over zero-shot performance—for example, Qwen2.5-VL-7B-Instruct improved from 13.6% to 19.2% Recall@1 after text-only fine-tuning. Interestingly, multimodal dialogues proved significantly more challenging than text-only ones, with all models performing worse on the multimodal dataset, highlighting the added complexity of integrating visual information.

This research has immediate practical for any system where people search for images through conversation. Imagine asking a virtual assistant for 'that photo from our trip last summer' and having it actually understand which photo you mean. The technique could enhance photo management apps, e-commerce search where customers describe products conversationally, or educational tools where students ask for visual examples. By transforming vague requests into precise search terms, AI systems could become much more useful in everyday situations where people naturally use references and incomplete descriptions rather than formal search queries.

Despite these promising , the study reveals important limitations. Even the best rewritten queries fall short of the performance ceiling set by direct image captions, which achieved 27.8% Recall@1 on text-only data. This indicates room for improvement in generating optimal retrieval queries. The researchers also observed that models sometimes experienced 'catastrophic forgetting' when trained on multimodal data after text-only training, losing some of their text-based reasoning capabilities. Additionally, different models showed complementary strengths—GLM-4.1V-9B-Thinking performed best on text-only tasks while LLaVA-v1.6-Mistral-7B-HF excelled at leveraging multimodal context—suggesting no single approach yet dominates this challenging task.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn