AIResearch AIResearch
Back to articles
Data

AI Still Struggles to Write Accurate Financial Reports

A new benchmark reveals that large language models often produce errors in financial analysis, with models fine-tuned for finance performing worse than general ones, highlighting risks for automated investing tools.

AI Research
March 27, 2026
3 min read
AI Still Struggles to Write Accurate Financial Reports

Large language models are increasingly being used to generate financial research reports, moving from simple writing aids to primary content producers in institutions like Union Bank of Switzerland and Citadel. However, these AI systems often produce factual errors, numerical inconsistencies, fabricated references, and shallow analysis, which can distort assessments of corporate fundamentals and lead to severe economic losses. Real-world examples, such as government reports by Deloitte containing fabricated references, underscore the risks, especially in finance where precision is critical. Yet, existing benchmarks mainly test reading comprehension rather than the ability to create reliable reports, leaving a gap in evaluating the full analytical process needed for financial decision-making.

Researchers from Tongji University and the Shanghai Artificial Intelligence Laboratory have introduced FinReasoning, a benchmark that assesses AI models on three key stages of financial report generation: semantic consistency, data alignment, and deep insight. This benchmark evaluates 19 large language models, including general models like GPT-5 and financial-specific ones like Fin-R1, using a dataset of 4,800 samples built from real financial materials such as research reports and market data. show that no model excels across all areas, with Doubao-Seed-1.8, GPT-5, and Kimi-K2 ranking as the top three overall, but each displaying distinct strengths and weaknesses, while financial fine-tuned models consistently underperform.

Ology behind FinReasoning breaks down financial analysis into hierarchical tasks that mirror real analyst workflows. For semantic consistency, models are tested on their ability to identify and correct errors in long financial texts, such as terminology misuse or logical breaks, using nine error-injection mechanisms. Data alignment tasks require models to verify numerical statements, perform calculations, and apply rule-based reasoning using a structured database of A-share market data from 2023 to 2025. Deep insight tasks assess open-ended analysis, where models must generate research-grade insights with causal reasoning and structured arguments, evaluated through a 12-indicator rubric covering dimensions like justification depth and factual grounding.

Analysis of reveals significant gaps in model capabilities. In semantic consistency, models scored an average of 44.8, with fact-type errors easier to detect than terminology or logic errors, but they struggled to correct errors, showing a 20.2-point gap between explaining and fixing terminology issues. In data alignment, average scores were 62.1, with performance declining from verification to calculation to reasoning tasks, and financial models like DianJin-R1-7B experiencing an 89.5% drop in success rates for complex reasoning. For deep insight, the average score was 72.5, but only a few models excelled in causal reasoning, indicating that most AI systems lack the multi-step analytical planning needed for high-quality financial analysis.

Of these are substantial for the financial industry, where automated report generation could lead to unreliable investment advice if models cannot maintain accuracy and coherence. The benchmark highlights that current financial fine-tuning often focuses on surface-level knowledge rather than structured reasoning, suggesting future development should prioritize hallucination correction and database-grounded operations. However, limitations include the benchmark's focus on Chinese data and controlled database interactions, which may not fully capture real-world financial workflows or the propagation of errors into downstream analysis, pointing to needs for more comprehensive testing environments.

Original Source

Read the complete research paper

View on arXiv

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn