Scientists have developed an AI-powered system that can automatically extract and organize vast amounts of data from scientific literature, addressing a critical bottleneck in materials research. This approach specifically targets two-dimensional (2D) materials, which are ultra-thin substances with unique properties that make them promising for applications like energy storage, flexible electronics, and water purification. Traditionally, valuable information about these materials—such as how they are synthesized and how they perform—is scattered across thousands of published papers, making it difficult for researchers to reuse and build upon existing knowledge. The new system, described in a recent study, uses large language models (LLMs) to mine this data at scale, creating a structured database that could dramatically speed up the development of new materials.
The researchers applied their to about 50,000 papers on 2D materials, successfully extracting 600,200 performance records and 202,300 synthesis records. These records include details like material names, properties such as specific capacitance or band gap, and synthesis s with specific conditions and reagents. For example, the database contains entries for materials like graphene, transition metal dichalcogenides (TMDs), and MXenes, with data on their performance in applications like supercapacitors and sensors. This massive dataset, curated from literature that previously existed in unstructured form, provides a foundation for more efficient materials by enabling researchers to query and analyze information that was once buried in PDFs.
To achieve this, the team built an end-to-end framework that combines LLMs with domain-specific adaptations. First, they collected literature using OpenAlex, a tool for accessing academic papers, and converted PDFs to standardized text. They then used a technique called context engineering, which provides the AI model with structured cues to precisely identify synthesis and property data, rather than relying on simple prompts that can be inconsistent. Additionally, they fine-tuned a smaller AI model on 3,000 manually annotated examples, using a called LoRA to improve its understanding of 2D materials terminology without requiring massive computational resources. This combination of fine-tuning and context engineering significantly boosted extraction accuracy, with precision rising from 64% to 91% for one model, as shown in Figure 3 of the paper.
Demonstrate that this AI-driven approach not only extracts data efficiently but also manages it intelligently. The extracted records were stored in a relational database, and the team developed a multi-agent system that allows users to query the database using natural language. For instance, a researcher could ask, "How many materials have a band gap greater than zero?" and the system would generate and execute the appropriate SQL query. In tests, the system achieved nearly perfect accuracy on simple and moderately complex queries, and about 90% accuracy on complex queries involving multiple tables, as illustrated in Figure 5. This makes the data accessible even to non-experts, reducing barriers to information retrieval and enabling faster analysis for tasks like synthesis pathway design or performance benchmarking.
This work has significant for the field of materials science, as it shifts research from a fragmented, experience-driven process to a data-guided workflow. By transforming literature into structured, computable data, the system supports reproducible benchmarking and hypothesis generation, which could lead to faster discoveries of new 2D materials for technologies like batteries or sensors. However, the study notes limitations, such as the need to refine evaluation strategies to address low recall rates caused by matching constraints. Future work will also expand extraction to include tables and figures from papers, aiming for more comprehensive knowledge acquisition. Overall, this AI framework offers a scalable solution for managing scientific data, with potential applications beyond 2D materials to other emerging material systems.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn