AI Reads 125-Page Documents on a Single Chip

TL;DR

This AI handles long business documents with top accuracy on affordable hardware, making advanced document understanding reachable for any enterprise.

Businesses and organizations worldwide are drowning in paperwork—invoices, contracts, reports, and forms that require manual review or expensive AI systems to process. A new AI model called Arctic-Extract offers a solution by delivering high-performance document understanding on hardware that is both affordable and widely available. Developed by researchers at Snowflake AI, this model can handle documents up to 125 pages long on a single A10 GPU with 24GB of memory, matching or surpassing the capabilities of much larger and costlier systems. This breakthrough addresses a critical gap in enterprise technology, where many advanced AI models are too resource-intensive for practical use, limiting their adoption in real-world scenarios.

The core achievement of Arctic-Extract is its ability to extract structured data from documents with exceptional accuracy while maintaining remarkable efficiency. The model excels in tasks such as question answering, entity extraction, and table extraction, processing both digital-born and scanned documents, including those with handwritten content. According to the paper, Arctic-Extract achieves an average score of 64.2 on visual document understanding tasks, outperforming models like GPT5 (58.5) and Claude 4 Sonnet (51.3). On the SQuAD2.0 benchmark for English text comprehension, it scores 78.94 in ANLS* metric, closely trailing the top-performing LLama 3.1 405B (80.37) but doing so with far fewer resources. This performance is particularly notable because Arctic-Extract weighs only 6.6 GiB, making it deployable on devices with limited memory, a stark contrast to models with hundreds of billions of parameters that require extensive computing power.

The researchers built Arctic-Extract using the architecture of Qwen2.5-VL, chosen for its unique token compression mechanism that merges four tokens into one vision token. This compression allows a standard A4 page to be represented with about 1000 tokens, enabling the model to fit more content into its 128,000-token context window. To optimize for efficiency, the team fine-tuned the model using LoRA (Low-Rank Adaptation), a technique that adjusts only a small subset of parameters rather than retraining the entire system. After fine-tuning, they merged the LoRA adaptations and applied 4-bit quantization using AWQ, reducing the model's memory footprint without significant loss in performance. The training involved 372,544 data points across 35 datasets, including public ones like DocVQA and internal collections for specialized tasks like table extraction, which required purpose-built datasets due to the lack of existing public resources.

Evaluation demonstrate Arctic-Extract's versatility and robustness across diverse document types and languages. On multilingual tasks, it achieved an average score of 70.7, leading over models such as LLama 3.1 405B (67.0) and GPT5 (66.8), with strong performance in languages ranging from English and Spanish to Japanese and Korean. For table extraction—a complex task involving transforming unstructured document information into structured tables—Arctic-Extract scored 0.720, slightly behind its predecessor Arctic-TILT (0.725) but ahead of Claude 4 Sonnet (0.707) and GPT5 (0.651). In the DocVQA benchmark, it achieved a score of 0.947, the highest among automated systems, just below human evaluation at 0.9811. These highlight the model's ability to handle real-world s, such as documents with poor scan quality or complex layouts, where other models often fail due to input size limitations.

Of this research are significant for enterprises seeking to automate document processing without incurring high costs. Arctic-Extract's efficiency makes it suitable for deployment on standard hardware, reducing barriers to adoption for businesses that lack access to expensive computing infrastructure. Its support for 29 languages, including Arabic, Chinese, and French, extends its utility to global operations, while its capabilities in extracting tables and entities can streamline workflows in finance, legal, and administrative sectors. However, the paper notes limitations, including the model's performance on certain specialized tasks where it slightly trails Arctic-TILT, and s of table extraction, which involves dealing with diverse structures, hierarchical headers, and data spread across multiple sources. Additionally, the training data included many internal datasets, which may limit reproducibility for the broader research community, though the authors provide references to overlapping datasets from prior work.

Looking ahead, Arctic-Extract sets a new benchmark for document understanding models by balancing performance with practicality. Its development underscores a shift in AI research toward creating solutions that are not only advanced but also accessible, addressing the real-world needs of industries burdened by document-heavy processes. As businesses continue to digitize, tools like this could become essential for efficient and accurate data extraction, paving the way for more widespread use of AI in everyday operations. The researchers have made their model and ologies transparent, offering a foundation for future innovations in resource-efficient AI systems.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn