AI Can't Tell Data Stories Accurately

When you ask an AI to explain database results in plain English, it often gets the story wrong—missing key details, inventing facts, or producing confusing narratives. This fundamental flaw in how AI systems communicate data to humans has remained largely unexamined until now. A new study reveals that even advanced language models struggle to accurately translate tabular data into natural language, with error rates increasing dramatically as datasets grow larger.

The researchers discovered that AI-generated natural language representations (NLRs) of database outputs contain significant inaccuracies that undermine their practical utility. When transforming database query results into human-readable narratives, models frequently omit critical information, hallucinate non-existent data, misinterpret ambiguous queries, and produce formatting inconsistencies that reduce readability. The study found that performance degrades substantially as result-set size increases, with accuracy dropping by 10-30% when moving from small datasets (fewer than 10 data points) to larger ones (10 or more data points).

The team developed a novel evaluation framework called Combo-Eval that combines traditional metrics with AI-based judgment to assess the quality of these data narratives. This hybrid approach uses simple text similarity measures like ROUGE scores for straightforward cases and only calls upon more computationally expensive language model evaluation when needed. The method was tested across 11 different domains including finance, sports, education, and gaming, using a newly created dataset called NLR-BIRD containing 1,468 samples of database queries and their corresponding human-verified natural language explanations.

Analysis of the results shows that Combo-Eval achieves 80.88% accuracy in identifying correct data narratives when ground truth references are available, outperforming both metric-only approaches (69.73%) and pure AI-judge methods (75.82%). More importantly, this improved performance comes with significantly reduced computational costs—cutting the number of expensive AI evaluations by 24-61% depending on the scenario. The framework proved particularly effective at identifying incorrect narratives, which is crucial for real-world applications where data accuracy matters.

The implications extend far beyond academic interest. As businesses increasingly rely on AI-powered chatbots and data interfaces, the ability to accurately communicate database results becomes essential for decision-making. A financial analyst needing to understand transaction patterns or a healthcare professional interpreting patient data cannot afford AI-generated summaries that miss critical details or invent information. The study's findings highlight a critical gap in current AI capabilities that could impact everything from customer service chatbots to business intelligence tools.

However, the research acknowledges several limitations. The dataset, while comprehensive, may not capture the full diversity of industrial applications, and the evaluation framework focuses on correctness rather than other important qualities like conciseness or stylistic appropriateness. The study also doesn't address how to improve the underlying AI models' narration abilities—it merely provides better tools for evaluating their current performance.

The work establishes a foundation for systematically assessing how well AI systems can bridge the gap between structured data and human understanding. As organizations increasingly deploy AI for data communication tasks, having reliable evaluation methods becomes essential for ensuring these systems provide accurate, trustworthy information rather than misleading narratives.

AI Can't Tell Data Stories Accurately

About the Author

Guilherme A.