AI Pulls Real Q&A from Textbooks to Train Better Models

TL;DR

An automated pipeline extracts question-answer pairs from educational PDFs, cutting hallucinations and costs compared to synthetic training data.

As large language models grow more advanced, their performance increasingly hinges on access to high-quality training data. Traditional s often rely on costly human annotation or synthetic data generated by AI, which can introduce errors and lack diversity. Now, researchers have developed an automated system that taps into a vast, underutilized resource: textbooks and educational materials, extracting genuine human-authored questions and answers to create better supervision for AI training.

The key finding is that this pipeline, called FlipVQA-Miner, can accurately extract both textual and visual question-answer (VQA) pairs from complex PDF documents with minimal noise. The researchers demonstrated that the system handles diverse structural s, including interleaved sequences where questions and answers alternate closely, long-distance pairs separated across pages or even different books, and multi-column layouts in languages like Chinese. In experiments, the pipeline achieved F1 scores above 0.98 for text extraction and up to 0.9886 for image placement, indicating high precision and recall in transforming raw educational content into AI-ready data.

Ology combines two main components: a vision-language model for structural parsing and a large language model for semantic reconstruction. First, MinerU 2.5, a document parsing toolkit, converts PDF files into a structured format, preserving layout information, text blocks, tables, and images with spatial metadata. This step breaks down documents into fine-grained content blocks, providing a clean foundation for further processing. Then, an LLM, specifically Gemini-2.5-pro, performs three reasoning operations on these blocks: grouping fragmented text into coherent questions or answers, pairing questions with their corresponding answers using hierarchical metadata, and inserting relevant images near associated content to form multimodal VQA pairs. By using block identifiers instead of raw text, this approach reduces computational costs and improves accuracy compared to direct prompting of LLMs on unstructured PDFs.

From three representative documents show the pipeline's robustness across different structural patterns. For a complex analysis solution manual with interleaved QA pairs, the system achieved a text F1 score of 0.9847 and a vision F1 score of 0.9720. In a long-distance case from an abstract algebra textbook, where questions and answers were separated across pages, it scored 0.9866 for text and 0.9615 for vision. Even with a Chinese middle-school math exercise book featuring multi-column layouts, it maintained high performance with a text F1 of 0.9933 and a vision F1 of 0.9886. Manual evaluation confirmed that extracted QA pairs were complete, correctly ordered, and free from hallucinated content, while images were accurately localized relative to their textual components. A qualitative example in Figure 2 illustrates the system's ability to resolve long-range dependencies, successfully assembling a question, figure, and answer from different pages and materials into a single structured VQA instance.

Of this work are significant for scaling up AI training with authentic human knowledge. By automating the extraction of high-quality QA pairs from educational documents, the pipeline offers a practical alternative to synthetic data generation, which often suffers from hallucination and stylistic uniformity. This could reduce the marginal costs of obtaining reliable supervision for supervised fine-tuning and reinforcement learning, addressing a bottleneck in model improvement. Moreover, the open-sourced code and data-processing pipelines enable broader adoption, potentially supporting the creation of curriculum-aligned benchmarks and large-scale training datasets for reasoning-oriented tasks. The researchers note that such authentic data may enhance factuality and diversity in AI models, particularly in domains where synthetic data remains unreliable.

Despite its strengths, the pipeline has limitations. The paper acknowledges that minor OCR imperfections from MinerU may persist, though their impact on QA pairs is negligible for training purposes. Additionally, the current evaluation focuses on high-level structural errors, ignoring minor text-level parsing issues like formula errors, which could affect precision in some contexts. Future work includes exploring how extracted educational materials can support broader evaluation of reasoning models, investigating their use in building training datasets, and developing more principled data curation strategies. The researchers also outline a detailed plan for benchmark curation, involving answer refinement, question-type classification, and filtering to ensure verifiability and difficulty balance, though this remains an area for further development.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn