Artificial intelligence systems that process both images and text have shown remarkable capabilities in recent years, but they face a fundamental challenge when dealing with multiple images at once. These multimodal large language models (MLLMs) often fail to grasp the complex relationships between different images, limiting their usefulness in scientific research, medical diagnosis, and other fields where understanding connections between visual data is crucial.
The key finding from this research reveals that current AI models struggle with three main issues when processing multiple images: they lack fine-grained understanding of disparate information, have diminished reasoning capabilities, and cannot effectively synthesize information from multiple inputs. This limitation becomes particularly problematic when dealing with temporal sequences, spatial relationships, or contextual connections between images.
To address these challenges, researchers developed a new method called QG-CoC (Chain-of-Captions) that breaks down complex image analysis tasks into simpler steps. The approach works by first decomposing the main question about multiple images into smaller, more manageable sub-questions. For each sub-question, the system generates specific captions that focus only on the relevant aspects of the images. Finally, it combines these sub-question and sub-answer pairs to form a coherent chain of reasoning that leads to the final answer.
The results demonstrate significant improvements across multiple benchmarks. On the MMIU dataset, which tests multi-image understanding, the QG-CoC method achieved accuracy improvements of up to 12.2 percentage points compared to standard approaches. The method showed consistent gains across different types of AI models, including both closed-source systems like GPT-4o and Gemini-1.5-Flash, and open-source models like LLaVA-OneVision-7B and Mantis-idefics2-8B.
This advancement matters because many real-world applications require understanding relationships between multiple images. In medical imaging, doctors often need to compare scans taken at different times to track disease progression. In scientific research, researchers analyze sequences of experimental results. Even in everyday scenarios, understanding how different visual elements relate to each other is essential for tasks like navigation, object tracking, and scene comprehension.
The research does have limitations. The method relies heavily on the quality of the captioning process, and less advanced AI models might not benefit as much from this approach. Additionally, the study acknowledges that it doesn't cover all possible multi-image scenarios, particularly those involving complex geometric shapes or detailed 2D and 3D spatial information. The method also introduces some computational overhead, requiring more processing time and token usage compared to simpler approaches, though the researchers argue this is a worthwhile trade-off given the accuracy improvements.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn