The quality of data used to train large language models has become a critical factor in their performance, yet a fundamental step in data preparation has been largely overlooked. Most AI research focuses on filtering and deduplicating web data, treating the initial conversion of raw HTML into usable text as a fixed, unchangeable process. However, a new study reveals that improving this extraction step can significantly enhance model capabilities, offering gains comparable to sophisticated filtering strategies. The research introduces a novel that preserves the structure and semantics of web content, leading to better-performing AI models across a range of tasks.
The researchers developed MinerU-HTML, a two-stage pipeline that uses a compact language model to extract main content from HTML documents while preserving structured elements like code blocks, mathematical formulas, and tables. Unlike traditional heuristic-based tools such as Trafilatura, which rely on text density rules and often corrupt or lose these elements, MinerU-HTML reformulates extraction as a sequence labeling problem. This approach allows it to understand semantic context, maintaining document coherence and critical technical content. On a newly created benchmark called MainWebBench, comprising 7,887 annotated web pages, MinerU-HTML achieved a ROUGE-N F1 score of 81.82%, substantially outperforming Trafilatura's 63.58%. For structured elements, it showed exceptional preservation, with edit similarity scores of 90.93% for code blocks and 93.99% for formulas, compared to Trafilatura's 13.05% and 61.07%, respectively.
To build MinerU-HTML, the team designed a scalable pipeline that processes HTML in three stages: pre-processing to simplify and chunk the document, classification using a 0.6-billion-parameter language model to label blocks as main content or boilerplate, and post-processing to reconstruct the cleaned output. A key innovation is constrained decoding, which ensures the model outputs only valid JSON labels without hallucinating content, guaranteeing that extracted text is a faithful subset of the original. For efficiency at web scale, employs template-aware optimization, clustering similar pages and applying model-based rules to entire clusters, reducing the need for GPU inference on every document. This makes the approach inherently scalable, as performance can improve with more training data and stronger base models, unlike heuristic s that offer limited improvement pathways.
The impact of this improved extraction was tested by constructing AICC, a 7.3-trillion-token multilingual corpus from Common Crawl using MinerU-HTML, and comparing it to a baseline corpus, TfCC, extracted with Trafilatura but otherwise identically processed. In controlled pretraining experiments, models trained on 62 billion tokens from AICC achieved an average accuracy of 50.82% across 13 benchmarks, outperforming TfCC by 1.08 percentage points. AICC also surpassed other leading datasets like RefinedWeb and FineWeb on key tasks, particularly in reading comprehension, where it scored 42.37% compared to FineWeb's 36.68%. Analysis of 10,000 document pairs showed that AICC was preferred over TfCC in 72% of cases by an AI judge, with MinerU-HTML preserving 1.16 times more content on average, judged as valuable rather than noise.
These have significant for AI development, highlighting that extraction quality is as impactful as aggressive filtering for model performance. By preserving structured elements and narrative coherence, MinerU-HTML enhances the learning of contextual understanding and long-range dependencies, which is crucial for tasks like reasoning and comprehension. 's scalability means it can evolve with advancing AI, offering a future-proof solution for data curation. The researchers have publicly released MinerU-HTML, MainWebBench, and AICC, encouraging further exploration into semantic-aware extraction s.
Despite its advantages, the approach has limitations. It currently does not handle JavaScript-rendered content from single-page applications, which may exclude dynamic web elements. Complex table layouts with merged cells or nested structures remain challenging, as evidenced by a table TEDS score of 73.88%, though this is a substantial improvement over baselines. Additionally, the template-based scaling relies on clustering accuracy, which may not capture all structural variations. Future work could address these by integrating rendering engines, improving clustering s, and validating the approach with larger models and multi-modal data. Nonetheless, the study establishes extraction as a critical, improvable component in the AI data pipeline, with direct benefits for model capabilities.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn