As artificial intelligence systems become increasingly integrated into business and research applications, their ability to accurately answer questions from complex documents remains a critical challenge. A new comprehensive dataset called ChiMDQA now provides the first standardized way to evaluate how well AI models handle multi-domain, long-form Chinese documents across academic, financial, legal, medical, educational, and news contexts. This benchmark reveals significant performance gaps that could impact real-world applications from financial analysis to medical research.
The researchers developed ChiMDQA, a dataset containing 6,068 carefully curated question-answer pairs drawn from lengthy PDF documents across six major domains. Unlike previous datasets that focused on single topics or short texts, ChiMDQA spans academic papers, financial reports, legal documents, medical guidelines, educational materials, and news articles, with documents averaging 230 pages in length. The dataset systematically categorizes questions into ten distinct types, ranging from simple fact retrieval to complex reasoning requiring integration of multiple information sources.
To construct the dataset, the team collected approximately 15,000 multilingual PDF documents and filtered them to include only high-resolution, non-scanned Chinese files published within the last five years. They employed a multi-stage pipeline involving document collection, question-answer pair generation using large language models, and rigorous validation through both automated screening and human review. The validation process included multi-model collaborative pre-screening where three different AI models independently assessed each question-answer pair, sensitivity screening to test robustness, and a five-stage human review process involving domain experts.
Experimental results testing eight leading AI models revealed significant performance variations. GPT-4o achieved the highest overall scores, with a Correct Given Attempted rate of 76.5% and BERTScore-F1 of 81.2, indicating strong factual accuracy and semantic alignment with correct answers. However, models showed substantial domain-specific weaknesses—in the financial domain, for example, GPT-4o's BERTScore-F1 dropped to 56.3, while Qwen-Plus achieved 84.7 in the same category. All models exhibited high perplexity scores, with Doubao-Pro-128k reaching 53.1, suggesting uncertainty when generating responses to the diverse question types.
The implementation of retrieval-augmented generation (RAG) systems showed measurable improvements, increasing F1-Score by an average of 4.6% across models and reducing perplexity by up to 81.2%. However, even with RAG, no model surpassed 40% F1-Score on retrieval-based metrics, and hallucination rates remained above 20% for most systems, indicating persistent challenges with generating accurate, context-appropriate responses.
For businesses and researchers relying on AI for document analysis, these findings highlight critical limitations in current systems. The performance gaps across domains suggest that AI models trained for general purposes may struggle with specialized content like financial reports or legal documents, potentially leading to inaccurate analyses in high-stakes applications. The dataset's fine-grained evaluation system, which includes 21 different metrics, provides a comprehensive framework for identifying and addressing these weaknesses.
The study acknowledges several limitations, including the dataset's current focus on Chinese language documents and its 3% error rate identified during validation. The researchers note that while the RAG approach improves performance, it doesn't fully resolve challenges with complex reasoning questions or domain-specific knowledge gaps. Future work will expand the dataset to include additional domains like engineering and environmental science while maintaining the rigorous validation standards established in the current version.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn