AIResearch AIResearch
Back to articles
AI

AI-Powered Data Governance Revolutionizes Enterprise ERP Systems at LinkedIn

In the high-stakes world of enterprise resource planning (ERP), managing hundreds of thousands of employee records across multinational operations has long been a nightmare of data inconsistencies and…

AI Research
November 24, 2025
4 min read
AI-Powered Data Governance Revolutionizes Enterprise ERP Systems at LinkedIn

In the high-stakes world of enterprise resource planning (ERP), managing hundreds of thousands of employee records across multinational operations has long been a nightmare of data inconsistencies and accessibility bottlenecks. When HR analysts at companies like LinkedIn need quick answers—such as determining how many civil engineers are working on a specific project in Moscow—they often face multi-day delays due to decentralized manual data entry in multiple languages and the SQL expertise required for querying. This scenario, detailed in a recent study by researchers from Hagia Labs, highlights a critical enterprise : data quality degradation and accessibility barriers prevent organizations from leveraging their own information effectively. The problem stems from two interconnected roots: inconsistent data from HR departments inputting information in Turkish, Russian, and English, and the technical bottlenecks that delay routine analytics. The study presents an end-to-end pipeline combining automated data cleaning with LLM-driven SQL query generation, deployed on a production system managing 240,000 employee records over six months, offering a glimpse into how AI could transform data governance in large corporations.

To tackle these issues, the researchers designed a sophisticated system architecture built on a Dockerized microservices framework, integrating components like FastAPI for backend services, LangChain for workflow orchestration, and FAISS for vector-based retrieval. The pipeline operates in two distinct phases: an offline data preparation stage that runs every 72 hours to synchronize and clean data from Microsoft SQL Server to PostgreSQL, and an online query processing stage that handles user-initiated natural language queries in real-time. During the cleaning phase, records pass through four sequential modules—translation normalization using MarianMT for converting multilingual fields to English, spelling correction with SymSpell and BERT-based s, entity deduplication via fuzzy string matching, and validation with business rule enforcement. This automated approach achieved 97.8% accuracy in correcting inconsistencies across 240,000 records, resolving issues like mixed-language entries (e.g., 'Moskva' to 'Moscow') and typographical errors, while reducing human intervention time by over 90% and processing all records in just 2.4 hours on a 16-core server.

The core of the innovation lies in the LLM-based SQL generation engine, which uses GPT-4o within a retrieval-augmented generation (RAG) framework to translate natural language questions into validated SQL queries. This system addresses key enterprise s, such as implicit business logic—where undocumented rules like filtering active employees require explicit encoding in prompts—and multilingual query understanding, handling inputs in Turkish, Russian, and English with minimal performance degradation. By leveraging a FAISS index of over 500 validated question-SQL pairs, the model employs few-shot learning to dynamically inject relevant examples into the prompt, boosting query validity from 76.4% in zero-shot scenarios to 92.5%. The workflow involves preprocessing queries for translation and normalization, retrieving top-k similar examples, generating SQL with schema constraints, validating syntax and safety, executing on PostgreSQL, and translating back to the user's language, all while logging interactions in MongoDB for continuous improvement and audit trails.

Evaluation from the six-month production deployment are staggering, with the system processing 2,847 real user queries and achieving 92.5% query validity, 95.1% schema compliance, and 90.7% semantic accuracy. Median end-to-end latency dropped to 3.6 seconds, compared to a pre-deployment average of 2.3 days for manual SQL writing, representing a 99.1% reduction in turnaround time. User satisfaction scored 4.3 out of 5.0, with high marks for multilingual support and query speed, though limitations were noted in handling ambiguous questions and complex multi-table joins. Cost analysis revealed an average of $0.042 per query, with GPT-4o offering 46% lower latency and 68% cost reduction versus GPT-3.5, while system uptime remained at 99.2%. An ablation study confirmed the critical role of each component: removing few-shot retrieval caused a 15.4% drop in validity, omitting translation preprocessing led to a 23.6% accuracy decrease on non-English queries, and skipping data cleaning reduced schema compliance to 81.3%, underscoring the necessity of an integrated approach.

Of this research extend beyond technical performance, suggesting a paradigm shift in how enterprises like LinkedIn could manage data governance and analytics. By democratizing data access through natural language interfaces, the system empowers non-technical users in HR and project management to perform independent queries, reducing reliance on IT staff and fostering more agile decision-making. However, the study also highlights limitations, such as prompt fragility requiring careful version control, s in adapting to schema evolution, and gaps in explainability that could affect user trust. Future work aims to enhance multi-table join support, explore fine-tuning of open-source models, and integrate conversational refinements. Ultimately, this pipeline demonstrates that AI-driven solutions can bridge human language and structured data at enterprise scale, paving the way for more intelligent and accessible ERP systems in the tech industry.

Original Source

Read the complete research paper

View on arXiv

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn