Hybrid AI Beats LLMs at Extracting Data from PDFs

TL;DR

Combining fixed rules with AI models pulls student data from academic PDFs with near-perfect accuracy, faster and cheaper than large language models alone.

Universities and colleges often struggle to efficiently extract and utilize data from student documents like course registration forms, especially when IT resources are limited. A new study demonstrates that a hybrid approach, blending traditional deterministic s with large language models (LLMs), can achieve near-perfect accuracy in extracting information from PDFs while maintaining computational efficiency on consumer-grade hardware. This breakthrough is particularly relevant for educational institutions in regions like Indonesia, where data privacy concerns and resource constraints make cloud-based solutions impractical, and direct database access is often restricted due to permissions and legacy systems.

The researchers evaluated three strategies for extracting data from Indonesian Study Plan (KRS) PDFs: LLM-only, a hybrid deterministic-LLM approach using regex and LLMs, and a Camelot-based pipeline with LLM fallback. They tested these on 140 documents for the LLM-based evaluations and 860 for the Camelot pipeline, covering four study programs with variations in tables and metadata. Using three 12-14B LLM models (Gemma 3, Phi 4, and Qwen 2.5) run locally on a CPU-only laptop via Ollama, they found that the hybrid approach improved efficiency over LLM-only s, especially for deterministic metadata like student names and IDs. The Camelot-based pipeline with LLM fallback produced the best combination of accuracy and speed, with exact match and Levenshtein similarity scores up to 0.99-1.00 and processing times under one second per PDF in most cases.

Ology involved extracting text from PDFs using the Fitz library, with pipelines designed to handle metadata via regex, course and lecturer lists via LLMs or Camelot, and Unicode normalization to address encoding issues like ligatures. The researchers used a few-shot prompt to guide LLMs in producing structured JSON output, with strict rules to avoid fabricating data. Evaluation metrics included exact match (EM) and Levenshtein similarity (LS), with a threshold of 0.7 to tolerate minor errors such as missing punctuation or spacing. The experiments were conducted on a consumer-grade laptop with a Ryzen 5 CPU and 16GB RAM, simulating real-world constraints for institutions with limited computational resources.

From the 140-PDF test showed that the Qwen 2.5:14b model demonstrated the most consistent performance across all scenarios, with high scores in both hybrid and LLM-only modes. For example, in the Informatics study program, the hybrid mode achieved perfect EM and LS scores of 1.000 across all models for metadata, courses, and lecturers. In contrast, Phi 4 struggled with lecturer lists, often omitting names when they appeared as advisors, leading to lower scores. The Camelot pipeline, tested on 860 PDFs, achieved throughput of less than one second per PDF, with lattice flavor handling 799 documents at 0.99 accuracy and stream flavor processing 52 perfectly, while LLM fallback addressed the remaining 9 cases. This represents a significant efficiency gain over the hybrid and LLM-only modes, which averaged 1.5-1.8 minutes per PDF.

Of this research are substantial for educational institutions worldwide, particularly those in resource-constrained environments. By enabling accurate and fast data extraction from academic documents without relying on cloud services or expensive hardware, this hybrid approach can streamline administrative processes like attendance tracking and grade management while ensuring data privacy. The study highlights that deterministic s like regex are effective for stable metadata, but LLMs or layout-aware tools like Camelot are necessary for handling variable table data, offering a balanced solution that reduces computational load and improves reliability. This could empower universities to better utilize their data for teaching evaluations and other purposes without overhauling legacy systems.

However, the study acknowledges several limitations. The experiments were conducted solely on text-based PDFs, not image-based or scanned documents, which limits applicability to OCR-dependent scenarios. Additionally, the models were tested on a CPU-only setup, and while throughput was improved with the Camelot pipeline, LLM-only s remained slow, taking over a minute per PDF. Future work could explore GPU acceleration to reduce inference times, model optimization through quantization, and expansion to image-based PDFs to broaden the system's scope. Despite these constraints, confirm that integrating deterministic and neural s is a reliable and efficient strategy for academic document extraction in computationally constrained environments.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn