AI's Document Dilemma: When to Read, When to See

In an era where documents drive everything from healthcare to finance, artificial intelligence systems that can read and understand them are increasingly critical. Yet a new study reveals that these AI tools are far from universal—their performance varies dramatically depending on the type of document they're processing. Researchers have developed a framework called DISCO that systematically evaluates when to use traditional text extraction s versus modern vision-language models, providing practical guidance for real-world applications.

The key finding from the DISCO evaluation is that no single approach works best across all document types. For handwritten text, specialized optical character recognition (OCR) systems remain more reliable, achieving a character error rate of 0.087 on the IAMDISCO dataset compared to 0.171 for vision-language models with generic prompts. However, when documents contain multilingual text, the tables turn—VLMs significantly outperform OCR, reducing character error from 5.53% to 0.73% on the ICDARDISCO dataset with task-aware prompting. Medical prescriptions in French proved particularly challenging for all s, with both OCR and VLMs showing similarly high error rates around 0.654-0.660.

The researchers employed a diagnostic ology that separates text parsing from question answering, allowing them to pinpoint where different systems succeed or fail. They tested three main approaches: OCR-based pipelines that first extract text then answer questions, two-stage VLM pipelines that parse then answer, and direct VLM question answering that processes images end-to-end. The evaluation covered diverse document types including handwritten text, multilingual scripts, medical forms, infographics, and multi-page documents, with each dataset limited to 500 samples for computational feasibility. This stage-wise protocol enabled attribution of errors to perception, representation, or reasoning stages rather than just reporting final accuracy.

Show clear patterns based on document structure. For single-page documents like forms and infographics, direct VLM question answering performed best, achieving a ground-truth-in-prediction score of 0.908 on DocVQADISCO compared to 0.876 for OCR-based approaches. However, for multi-page documents in the DUDEDISCO dataset, OCR pipelines maintained an advantage with a score of 0.562 versus 0.498 for direct VLM approaches. The data also revealed that task-aware prompting had mixed effects—it improved VLM performance on handwriting recognition but offered limited gains on medical prescriptions. Interestingly, OCR system selection mattered too: azure-ai-documentintelligence outperformed mistral-ocr-2505 by 3.3 percentage points on single-page forms but performed comparably on multi-page documents.

These have immediate practical for organizations implementing document intelligence systems. The research suggests a dual strategy: use OCR-based pipelines for complex text, long documents, and text-heavy reasoning where structured representations are essential, while employing VLM-based approaches for visually grounded documents like infographics and multilingual content where spatial layout matters. This document-aware selection could improve accuracy in applications ranging from processing medical records to analyzing financial reports, potentially reducing errors in critical domains.

The study acknowledges several limitations that point to future research directions. The evaluation focused on contexts where full documents could be processed directly, without assessing retrieval mechanisms needed for practical long-document systems spanning dozens or hundreds of pages. While the ICDARDISCO dataset revealed the need for better multilingual support, the suite primarily covers English and French text in Latin scripts, leaving non-Latin scripts and culturally specific layouts under-evaluated. The researchers also note metric limitations—models frequently located correct information but failed to format answers appropriately, revealing a gap between answer localization and output formatting that future work should address.

AI's Document Dilemma: When to Read, When to See

Original Source

About the Author

Guilherme A.