AI Reads ESG Reports with 93% Accuracy for Investors

TL;DR

A new AI system parses complex sustainability reports and turns messy documents into structured data for investors and regulators.

Environmental, Social, and Governance (ESG) reports are becoming mandatory for companies worldwide, serving as critical tools for investors and regulators to assess sustainability. However, these documents are often lengthy, visually dense PDFs with chaotic layouts and implicit hierarchies, making them difficult to analyze at scale. This complexity has forced financial research to rely on indirect proxies like simple disclosure indicators or third-party ratings, bypassing the rich semantics within the reports themselves. The new system, Pharos-ESG, addresses this by transforming ESG reports into structured representations through multimodal parsing, contextual narration, and hierarchical labeling, enabling accurate and large-scale understanding.

Pharos-ESG achieves this by integrating four core components: reading-order modeling based on layout flow, structure reconstruction guided by table-of-contents anchors, contextual transformation of visual elements into natural language, and multi-level labeling across ESG topics, GRI indicators, and sentiment. The reading-order modeling uses a Relation-Aware Transformer to predict succession between content blocks, constructing a directed graph that ensures a globally consistent order. For hierarchy reconstruction, it employs a ToC-centered framework with Region-Aware Prompting to parse diverse layouts and an alignment module called ALIGN that matches headings to document body content through exact, fuzzy, and context-aware matching. Visual elements are converted into text via a two-stage pipeline that aggregates images with surrounding content and generates descriptions using models like Qwen2.5-VL-Instruct, preserving semantic continuity.

Extensive experiments on expert-annotated benchmarks show that Pharos-ESG consistently outperforms both dedicated document parsers and general-purpose multimodal models. In comprehensive ESG report analysis, it achieved 92.23% precision, 95.00% recall, and an F1-score of 93.59%, surpassing Textin's 82.55% F1 and Gemini 2.5 Pro's 87.50% F1. For reading order prediction, it scored a Reading Order Kendall’s Tau of 0.92, indicating strong sequence alignment under complex layouts. In ToC-body title alignment, it reached 92.46% accuracy, far exceeding general-purpose models like GPT-4o at 64.30% and dedicated parsers below 20%. Ablation studies confirmed that each component—reading order modeling, ToC parsing with RAP, and ToC-body alignment—contributes significantly to performance, with the full system achieving the highest .

Of this breakthrough are substantial for financial markets, as it enables fine-grained analysis of ESG disclosures that was previously impractical. Pharos-ESG's multi-level label prediction module, MLPDH, maps content blocks to ESG categories, GRI indicators, and sentiment with an 86.32% macro-F1 score, outperforming baselines like BERT-base at 76.71%. This supports applications such as cross-document consistency checks for greenwashing detection, investor sentiment analysis, and cross-market ESG benchmarking. Additionally, the release of the Aurora-ESG dataset—over 24,000 reports from Mainland China, Hong Kong, and U.S. markets with unified structured representations—provides a valuable resource for integrating ESG data into financial governance and decision-making, covering more than 8 million content blocks.

Despite its strengths, Pharos-ESG has limitations noted in the paper. Its performance varies across markets, with slightly lower parsing F1 and TBTA on Hong Kong reports compared to China, though it excels on U.S. reports due to more standardized formats. The system relies on expert annotations for training and evaluation, which may limit scalability if applied to new regions without similar data. Additionally, while it outperforms general-purpose models, the computational costs and potential for hallucinations in long documents remain s, as seen in the incremental testing required for models like GPT-4o due to context limits. Future work could address these by expanding dataset coverage and refining the alignment strategies for even more diverse layouts.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn