AI Still Can't Read Science PDF Tables Accurately

TL;DR

A new benchmark tested leading AI models on extracting tables from scientific PDFs and found major accuracy gaps — see which systems failed and why.

In the sprawling digital archives of scientific research, tables hold a treasure trove of data—from experimental to statistical summaries—that are essential for automated analysis and data integration. Yet, extracting these tables from PDF documents remains a formidable , as current AI-driven s struggle with the sheer diversity of layouts, formats, and content found in real-world documents. A groundbreaking study by researchers from Inria, ENS, and BRGM introduces a comprehensive benchmark that rigorously evaluates end-to-end table extraction (TE) techniques, revealing significant shortcomings in generalizability, robustness, and interpretability across state-of-the-art models. This work not only highlights the persistent hurdles in TE but also provides a much-needed framework for future advancements, emphasizing that while progress has been made, the dream of flawless table extraction is far from realized.

Ology behind this benchmark is both meticulous and innovative, designed to assess TE s from PDF input to structured table output. The researchers developed a rigorous evaluation process that includes novel metrics for table detection (TD), table structure recognition (TSR), and end-to-end TE, capturing model uncertainty through confidence scores. They compiled three diverse datasets: PubTables-Test, derived from biomedical publications; Table-arXiv, a new heterogeneous collection from arXiv preprints; and Table-BRGM, a domain-specific set from geological reports in French and English. These datasets, totaling over 37,000 samples, were used to test a wide array of s, including rule-based Python libraries like PDFPlumber and Camelot, machine learning tools such as Grobid, large vision language models (LVLMs) like GPT-4o mini, and computer vision-based approaches including TATR-extract and VGT+TATR-structure. Each was evaluated on its ability to detect tables, recognize their structure and content, and provide reliable confidence estimates, with metrics like Intersection-over-Union (IoU) for TD and GriTS and TEDS for TSR ensuring a fair comparison.

The experimental paint a stark picture of the current state of table extraction, with performance varying dramatically across datasets and s. On the PubTables dataset, models like TATR-extract and Docling achieved near-perfect scores in table detection, thanks to their training on similar data, but their performance plummeted on more heterogeneous sets like Table-arXiv and Table-BRGM. For instance, probabilistic models such as VGT+TATR-structure showed better calibration and interpretability, with confidence scores that aligned more closely with actual precision, whereas TATR-based s often produced meaningless confidence estimates. In table structure recognition, GriTS Topology scores were generally high, indicating that models could often identify row and column layouts correctly, but GriTS Content and TEDS scores were lower, reflecting persistent issues in accurately extracting textual content, especially with complex elements like merged cells or mathematical formulas. The LVLM, despite its cost-efficiency, suffered from hallucinations and poor spatial accuracy, underscoring the limitations of zero-shot learning for precise tasks.

Of these are profound for fields reliant on data extraction, such as scientific research, business analytics, and data lake management. The benchmark reveals that no single excels universally; instead, the choice of tool must be tailored to document characteristics, with object detection-based models like Docling and VGT offering the best overall performance but still falling short in robustness. This variability underscores the need for improved generalizability in AI models, potentially through better training on diverse datasets or hybrid approaches that combine visual and textual cues. Moreover, the study's emphasis on end-to-end evaluation—rather than assessing subtasks in isolation—provides a more realistic measure of real-world applicability, helping users avoid the pitfalls of overoptimistic metrics that ignore the cascading errors between detection and structure recognition.

Despite its thoroughness, the study has limitations, including a focus on scientific PDFs, which may not fully represent other document types like scanned images or business reports. The reliance on PDFAlto for token extraction introduced biases, particularly with stylized text such as equations, and the benchmark did not explore the impact of document rotation or extreme layout variations. Future work could address these gaps by incorporating more diverse data sources, developing recalibration techniques for confidence scores, and enhancing models to handle multilingual and multimodal content. The researchers have made their benchmark publicly available, encouraging further innovation in a field where, as this study conclusively shows, table extraction remains an unsolved problem demanding continued attention and refinement.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn