AIResearch AIResearch
Back to articles
Data

AI Spots Hidden IT Problems in Company Documents

A new AI method can detect early warning signs of architectural debt in unstructured business documents, helping organizations avoid costly inefficiencies before they escalate.

AI Research
April 02, 2026
4 min read
AI Spots Hidden IT Problems in Company Documents

Enterprise architecture debt, the accumulation of suboptimal design choices and misaligned components in an organization's IT landscape, can silently degrade performance, increase costs, and reduce agility over time. Traditionally, identifying these issues has relied on manual s like workshops and expert assessments, which are time-consuming, costly, and difficult to scale with the complexity of modern IT systems. This leaves much of the architectural knowledge embedded in unstructured documents—such as process descriptions, strategy papers, and meeting notes—under-analyzed, creating a significant gap in early detection. A new study explores whether large language models (LLMs) can bridge this gap by automatically detecting enterprise architecture smells, the early indicators of potential debt, in unstructured documentation, offering a scalable solution to a longstanding .

The researchers found that LLMs can effectively identify multiple predefined enterprise architecture smells in unstructured text, with a custom GPT-based model achieving high precision and fast processing speeds, while a fine-tuned on-premise model offers advantages in data protection. In the study, the custom GPT benchmark model demonstrated a precision of approximately 0.88 and a false positive rate of about 0.09, meaning it rarely flagged issues incorrectly, though it sometimes missed embedded smells, resulting in modest recall. The on-premise model, based on LLaMA-3.2-3B, showed higher sensitivity but produced more false positives, with precision around 0.26 and a false positive rate as high as 1.00 in some runs. This trade-off highlights that LLMs can surface plausible issues, but their effectiveness varies based on model choice and deployment constraints, with the GPT model excelling in accuracy and the on-premise model serving as a more cautious tool under strict data privacy requirements.

Ology involved a design science research approach, where the researchers built and evaluated an LLM-based prototype for automated smell detection. They focused on 12 business-layer enterprise architecture smells, such as 'Contradiction in Input' and 'Temporary Solution,' selected for their high textual manifestation and feasibility for detection. The prototype ingested unstructured documents in formats like .docx and .pdf, preprocessed them through tokenization and chunking, and applied fine-tuned detection models to identify smells and extract rationales. For the on-premise model, they used LLaMA-3.2-3B-Instruct, fine-tuned with a parameter-efficient technique called Low-Rank Adaptation (LoRA) on a dataset of 960 synthetic examples covering eight business domains. A custom GPT model was configured as a benchmark using the same training data and prompts, relying on in-context learning without parameter updates, to compare performance under different deployment scenarios.

, Based on a case study with 30 synthetic yet realistic business documents from a fictional firm, showed distinct performance patterns. The custom GPT model processed documents in about 2 seconds each, achieving high precision but with recall values around 0.22 to 0.25, indicating it missed some embedded smells. In contrast, the on-premise LLaMA model took approximately 120 seconds per document and 50 minutes for a batch of 30, with lower accuracy and higher false positives, as detailed in Table 4 of the paper. Error analysis revealed recurring issues such as omission of embedded smells, misclassification, batch context leakage, and fabricated citations, with the on-premise model often over-detecting and producing generic explanations. For example, in one scenario, an Order-to-Cash description was incorrectly flagged as 'Shiny Nickel,' while the GPT model correctly identified 'Project Goals not Achieved' in another case but missed it in batch settings, illustrating s of consistency and traceability in LLM outputs.

Of this research are significant for enterprise architecture practice, as LLM-based smell detection can provide timely, data-driven insights to support proactive maintenance and cost-effective governance. By automating the analysis of unstructured documentation, organizations can scale their debt identification efforts beyond manual s, though the study emphasizes that LLM outputs should be used as triage signals rather than definitive diagnoses, requiring expert validation. The trade-offs between data protection and performance are clear: the on-premise model suits sensitive environments with strict confidentiality needs, while the GPT model offers better accuracy and usability but may not be feasible where external APIs are prohibited. This work extends prior research focused on structured artifacts by addressing the semantic richness of business-layer text, potentially enabling earlier intervention to prevent debt accumulation and its associated costs.

Limitations of the study include the use of synthetic data, which may not fully capture the nuance and noise of real-world documentation, and the assumption that documentation is up-to-date, which might not hold in practice. The training set was limited to 12 smells with 960 examples, constraining generalization, and technical constraints like CPU-only hardware restricted the range of models explored. Additionally, accuracy assessments were manually verified by a single researcher, lacking inter-rater reliability, and batch processing introduced context leakage across documents. These factors suggest that while LLMs show promise, future work needs larger, diverse datasets, multi-rater labeling, and improved grounding techniques to enhance reliability and applicability in real enterprise settings.

Original Source

Read the complete research paper

View on arXiv

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn