As engineering fields expand with new theories and applications, sharing specialized knowledge across disciplines becomes increasingly challenging. Traditional technical documents, while rich in content, often remain inaccessible to those outside the specific domain, creating barriers to collaboration and innovation. A new approach using large language models (LLMs) offers a solution by transforming dense technical texts into interactive, machine-readable knowledge bases that maintain precision while enhancing accessibility.
The researchers developed a method that converts LaTeX documents—commonly used for publishing scientific papers—into structured knowledge graphs. These graphs represent engineering concepts, their properties, and relationships in a formal, explicit format that computers can process and humans can query. The key innovation lies in using LLMs to semi-automatically extract and formalize knowledge from technical texts, significantly reducing the manual effort typically required for such tasks.
The process begins by dividing LaTeX source code into small units of about five sentences each. For each unit, researchers use an extensive template (11 KB with 240 examples) to prompt an LLM to identify core concepts, items, and relationships. This generates an intermediate representation called Formalised Natural Language (FNL), which is then algorithmically converted into Python code using the PyIRK framework. This framework represents knowledge through subject-predicate-object triples, where each element receives a unique identifier for traceability. While the LLM handles most of the extraction, human reviewers still need to correct approximately 10-20% of the output, depending on the text's complexity.
The resulting knowledge graph enables several practical applications. Most notably, the researchers created an interactive document layer where readers can hover over technical symbols to see their precise definitions and relationships. This addresses a common problem in scientific reading: concepts introduced early are assumed known later, forcing readers to flip back through pages when they encounter unfamiliar terms. In their test application to eight pages of control engineering content, the system generated approximately 700 interactive elements. For example, when a reader encounters the symbol U⊥ (orthocomplement), they can hover to see its definition without interrupting their reading flow.
This approach matters because it bridges the gap between expert knowledge and broader accessibility. Control engineering encompasses everything from simple PID controllers to complex nonlinear systems used in automotive, robotics, and building automation. Making this knowledge more accessible could accelerate innovation across these domains by enabling engineers from different specialties to understand and apply concepts outside their immediate expertise. The formal representation also allows for advanced querying, consistency checking, and integration with simulation data—capabilities impossible with traditional PDF documents.
The method currently faces limitations, primarily the need for manual correction in the LLM output phase. The researchers acknowledge this as a bottleneck but are optimistic about reducing intervention through improved quality assurance measures. Additionally, the approach currently requires LaTeX source code rather than finished PDF documents, though future versions aim to process PDFs directly. The system also requires a "critical mass" of formalized knowledge to become fully useful as an assistant that can answer theoretical questions, which remains a goal for future development.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn