Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models

TL;DR

A training-free method helps AI identify key visual and text evidence, boosting accuracy on complex questions with no model changes needed.

When AI systems answer questions about images, they often struggle to sift through noisy or irrelevant information, leading to incorrect or unfounded responses. This is particularly acute in knowledge-based visual question answering (KB-VQA), where models must combine visual cues with external textual knowledge, such as Wikipedia articles, to generate accurate answers. Researchers from the University of Modena and Reggio Emilia have developed a novel approach, called Look Twice (LoT), that addresses this issue by enabling AI models to better identify and focus on the most relevant evidence during inference, all without requiring any additional training or architectural changes. This advancement could enhance the reliability of AI in applications ranging from educational tools to automated content analysis, where precise information retrieval is critical.

The key finding of the research is that pretrained multimodal large language models (MLLMs) can significantly improve their evidence selection by leveraging their own internal attention patterns. The LoT framework operates by having the model generate a single token initially, which is used to analyze attention signals between the question, image, and retrieved text. From this analysis, the system identifies which visual regions and textual sentences are most relevant to the query. For example, in a question about a butterfly's wing color, LoT pinpoints the butterfly in the image and highlights sentences describing its coloration from the retrieved context. This process is entirely training-free, meaning it works with off-the-shelf models like Qwen2-VL and InternVL3.5 without any parameter updates, making it a practical and scalable solution for real-world deployment.

Ology behind LoT involves a two-pass inference strategy that exploits the model's attention dynamics. In the first pass, the model generates one token, and researchers extract attention matrices to estimate relevance. For visual evidence, they aggregate object-conditioned attention between question tokens and visual tokens to create a relevance map, then filter out spurious activations caused by attention sinks—tokens that disproportionately attract attention regardless of semantic importance. This filtering step, illustrated in Figure 2 of the paper, removes irrelevant visual tokens by analyzing hidden-state dimensions, ensuring that only query-relevant regions are highlighted. For textual evidence, attention from the generated token to context tokens is used to identify informative sentences, with those exceeding a threshold selected for highlighting. The selected cues are then marked with lightweight prompt-level markers, such as and , guiding the model to re-attend to them during the second pass when generating the final answer.

Experimental across multiple KB-VQA benchmarks demonstrate consistent improvements. As shown in Table 1, LoT enhanced performance on datasets like Encyclopedic-VQA, InfoSeek, OVEN, and ViQuAE across various model sizes. For instance, Qwen2-VL-2B improved from 5.4 to 10.3 on InfoSeek, while InternVL3.5-4B increased from 36.4 to 45.6 on ViQuAE. Larger models also benefited, with InternVL3.5-38B rising from an average score of 34.1 to 37.5 across benchmarks. Ablation studies in Table 2 confirmed that both visual and textual highlighting contribute to these gains, with their combination yielding the best . Additionally, LoT proved effective in standard MLLM benchmarks without retrieved text, as seen in Table 3, where visual highlighting alone improved performance on tasks like RealWorldQA and hallucination evaluation, indicating broader applicability beyond KB-VQA.

Of this research are significant for improving AI's ability to handle complex, multimodal queries in real-world scenarios. By enabling models to focus on relevant evidence without retraining, LoT reduces computational overhead and maintains generalization, making it suitable for applications in education, healthcare, and content moderation where accurate, evidence-based responses are essential. However, the study acknowledges limitations, such as the reliance on attention patterns that may not always perfectly correlate with relevance, and the need for threshold parameters in evidence selection. Future work could explore adaptive thresholds or integration with other inference-time s to further enhance robustness and scalability.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn