Artificial intelligence systems designed to read and understand documents are systematically failing to process historical Black newspapers, according to a new study. This oversight is not due to a lack of technical capability but stems from how these AI systems are evaluated, with current benchmarks prioritizing modern, Western documents and ignoring the unique layouts and typography of community-produced archives. The research, which systematically reviews optical character recognition (OCR) and document understanding evaluation s from 2006 to 2025, finds that this gap leads to structural invisibility, where errors specific to historical materials go unreported, potentially distorting digital representations of Black history.
The key finding is that state-of-the-art OCR systems, including vision-language models like olmOCR 2 and Qwen2.5-VL, are trained and evaluated on datasets that exclude historical Black newspapers. The study analyzed 13 major OCR systems and found that six rely on the IIT-CDIP database, which contains 42 million pages of mid-20th century tobacco litigation documents, while others use synthetic data or web-scale corpora focused on modern formats. As a result, these systems are not tested on the degraded scans, Victorian typefaces, and complex multi-column layouts characteristic of publications like Freedom’s Journal (1827) and The North Star (1847), leading to performance gaps that remain hidden under standard evaluation metrics.
Ology involved a systematic review using the PRISMA 2020 framework, analyzing 80 papers from databases like ACM Digital Library and arXiv. Researchers extracted data on model architectures, training sources, and evaluation benchmarks, focusing on how systems report historical coverage. They also conducted a case study on The Weekly Advocate (1837), an early African American newspaper, testing three systems: Tesseract v5, Surya, and olmOCR 2. This qualitative analysis aimed to identify failure patterns beyond aggregate scores, such as layout errors and hallucinated text, which are not captured by common metrics like Character Error Rate (CER).
Show significant evaluation gaps. For instance, olmOCR 2 achieved 82.3% accuracy on historical math scans but dropped to 47.7% on general historical scans, while GOT-OCR 2.0 scored only 22.1% on the olmOCR-Bench. In the case study, Tesseract v5 misinterpreted vertical column rules as text, causing semantic merging of distinct content, while olmOCR 2 overwrote visual evidence with fabricated text, a phenomenon described as corrective hallucination. These errors are overlooked because benchmarks like OCRBench and OmniDocBench focus on modern documents, with only KITAB-Bench explicitly including historical materials, and none targeting Black press archives.
Are profound for cultural preservation and historical accuracy. Black newspapers, such as those in Howard University’s Black Press Archives with over 2,000 titles, are critical records of American political and social history, yet their digitization is hindered by AI systems that erase editorial logic. For example, multi-column layouts used to convey rhetorical meaning are often flattened into linear text, stripping away context. This represents more than a technical flaw; it is an epistemological issue where evaluation choices reflect assumptions about which documents matter, potentially reinforcing historical biases and limiting access to marginalized narratives.
Limitations of the study include its focus on 19th-century Black newspapers in the United States, which may not generalize to other historical or linguistic contexts. The analysis prioritized qualitative failure modes over ground-truth transcriptions, reflecting the constraints of historical archives where such data is scarce. Future work is needed to develop layout-aware metrics that respect archival integrity, moving beyond character-level accuracy to include structural fidelity. Additionally, the study highlights the tension between making archives accessible and preventing problematic uses, such as over-historicization where AI generates plausible but false historical content, underscoring the need for culturally responsible evaluation practices.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn