When financial analysts use AI to search through earnings reports, they often rely on systems that convert charts and graphs into text descriptions before processing. This common approach, however, may be losing critical details that affect accuracy. A new study from PricewaterhouseCoopers U.S. reveals that AI systems perform significantly better when they retrieve images directly, without first summarizing them into text, offering a more reliable way to handle complex financial documents.
The researchers found that direct multimodal embedding retrieval, where images are stored natively in a vector database, outperforms text-based s by a substantial margin. In their experiments, this approach achieved a 32% relative improvement in mean average precision (mAP@5), a key metric for retrieval accuracy, and a 20% boost in normalized discounted cumulative gain (nDCG@5), which measures ranking quality. Specifically, mAP@5 increased from 0.3963 for text-based retrieval to 0.5234 for direct image retrieval, while nDCG@5 rose from 0.5448 to 0.6543. These gains indicate that preserving visual information in its original form allows AI to match queries with relevant documents more effectively, especially for financial data where charts convey nuanced numerical trends.
The study compared two retrieval approaches using a newly created benchmark of 40 question-answer pairs from Fortune 500 company earnings calls. In the text-based , images from presentation slides were summarized into text using OpenAI GPT-5, and both these summaries and transcript chunks were embedded with OpenAI text-embedding-ada-002 for retrieval. In contrast, the direct multimodal approach used Jina Embeddings v4 to embed both text and images natively into a shared vector space, stored in Azure AI Search. This allowed queries to retrieve visual content without intermediate conversion, maintaining spatial relationships and numeric precision that are often lost in text summaries. The evaluation spanned six OpenAI language models, including GPT-4o, GPT-4.1, and GPT-5 variants, with retrieval metrics computed over the top-5 documents and answer quality assessed via LLM-as-a-judge pairwise comparisons.
From the retrieval metrics showed consistent advantages for the direct multimodal . Beyond the improvements in mAP@5 and nDCG@5, Precision@5 increased from 0.480 to 0.540, and Recall@5 improved from 0.5362 to 0.5529. More importantly, end-to-end answer quality, evaluated by OpenAI GPT-5 judging across six criteria, favored the multimodal approach with an overall win rate of 0.612 compared to 0.388 for text-based retrieval. For larger models like GPT-5, the gap was even more pronounced: it achieved a win rate of 0.82, with particular strengths in correctness (0.70 vs. 0.30), numerical fidelity (0.80 vs. 0.20), and reducing unsupported additions or hallucinations (0.90 vs. 0.10). This suggests that direct image retrieval not only enhances retrieval accuracy but also leads to more factual and complete answers, crucial for financial analysis where errors can have significant consequences.
Of these extend beyond academic research to practical applications in industries reliant on multimodal documents. Financial institutions, for example, could deploy AI systems that handle earnings calls and reports with greater precision, reducing the risk of misinformation from hallucinated details. The study highlights that while text-based summarization is convenient and compatible with existing infrastructure, it introduces information loss that degrades performance. As multimodal embedding models like Jina v4 become more accessible, organizations may prioritize them for tasks involving charts, tables, and diagrams, where visual context is essential. However, the benefits are most pronounced with larger AI models, indicating that effective utilization requires sufficient computational capacity for cross-modal reasoning.
Despite its advantages, the direct multimodal approach faces limitations, particularly in preprocessing complexity. Unlike text-based s that use a single LLM call to convert images, multimodal retrieval requires additional steps such as image detection, extraction, and format conversion. Documents must be parsed to identify visual elements, which can be challenging across diverse formats like PowerPoint slides or PDF reports. Tools like Azure Document Intelligence or Unstructured.io offer partial solutions, but automated pipelines for robust segmentation and classification remain an area for future development. This operational overhead may slow adoption in production environments, suggesting that while the technical benefits are clear, practical implementation needs further refinement to reduce manual effort and ensure scalability.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn