As environmental, social, and governance (ESG) factors become increasingly critical in financial decision-making, from credit assessments to portfolio management, the ability to analyze corporate sustainability reports efficiently is more important than ever. These reports, often lengthy and varied in format, present a significant for both human analysts and automated systems seeking to extract comparable insights. Large language models (LLMs) offer a promising solution, but their effectiveness is hampered by issues of grounding—ensuring answers are based on actual document content—and a lack of standardized benchmarks to evaluate performance in this specific domain. This gap makes it difficult to assess how well AI can handle the nuanced data in ESG disclosures, which mix narrative descriptions with numeric tables covering topics like emissions and diversity.
The researchers behind ESGBench, a new benchmark for explainable ESG question answering, found that current AI systems perform modestly at best when tasked with extracting information from sustainability reports. In baseline experiments using a retrieval-augmented generation (RAG) system, which combines document retrieval with language model generation, the system achieved an exact match rate of only 21%, meaning answers matched the correct text precisely in just over one-fifth of cases. String F1, a measure of token overlap that allows for partial matches, averaged 55%, while numeric accuracy, which accounts for units and tolerates a 2% margin of error, was 45%. These underscore the difficulty AI faces in grounding answers, particularly for numeric key performance indicators (KPIs) such as Scope 1–3 emissions, renewable energy usage, and diversity metrics, where errors in formatting or unit conversion can lead to inaccuracies.
To create ESGBench, the researchers developed a reproducible pipeline that starts by collecting ESG and TCFD (Task Force on Climate-related Financial Disclosures) reports from publicly available sources, focusing on large companies across sectors like technology, consumer goods, energy, finance, and manufacturing. The initial dataset includes 12 reports from 10 companies, generating 119 question-answer pairs with an average of 9.9 pairs per report, split across environmental, social, governance, strategy, and risk categories. The pipeline processes these PDFs into two views: passage chunks of 600–1200 characters for narrative content and table rows for numeric data, ensuring that evidence is traceable to specific pages. QA pairs are automatically generated using prompts that enforce verbatim answers and evidence quotes, with de-duplication to avoid redundancy, resulting in a gold dataset stored in a structured format.
Analysis, detailed in Table 2 of the paper, reveals significant variations in performance across different categories of questions. Governance and strategy questions proved easier for the RAG baseline, with exact match rates of 43.5% and 90.9% respectively, likely because they involve shorter factual spans. In contrast, environmental questions, which often require extracting numeric data from tables, had a lower exact match rate of 48.0% and numeric accuracy issues, highlighting s with formatting, unit conversion, and scale words like "thousand." Retrieval recall@5, which measures whether the correct evidence page is among the top five retrieved, ranged from 70–80%, indicating that retrieval gaps contribute to answer errors. This emphasizes the need for systems to not only generate accurate answers but also retrieve relevant context effectively, as poor retrieval can propagate through to final outputs.
Of these are substantial for the financial industry and sustainability analytics, where reliable data extraction from ESG reports is essential for informed decision-making. ESGBench provides a foundation for developing more robust AI systems that can handle the heterogeneity of sustainability disclosures, enabling faster and more comparable analysis. By offering open-source code and a standardized evaluation suite, including metrics like exact match, string F1, numeric accuracy, and retrieval recall, the benchmark encourages community efforts to improve s for explainable and trustworthy ESG analytics. This could lead to better tools for investors, regulators, and companies seeking to assess ESG performance accurately, though the paper cautions that ESGBench does not evaluate the truthfulness of corporate claims—only the ability to recover statements from disclosures.
Despite its contributions, ESGBench has several limitations noted in the paper. The seed corpus is biased toward large multinational companies with English-language reports, which may not represent the full diversity of ESG disclosures globally. Automatic QA generation can introduce noise, as prompts may miss context or favor easily extractable spans, though the requirement for verbatim evidence and recall metrics helps mitigate this. Tables and units present fragile failure modes, with variants like "tCO2 e" and "ktCO2 e" causing errors, and the benchmark does not address ethical governance of corporate claims. Future work could involve human validation of QA pairs, expansion to multilingual reports, alignment with evolving standards like CSRD and BRSR, and improvements in table parsing and robustness testing, aiming to enhance the benchmark's utility and coverage.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn