Large language models (LLMs) are increasingly used in applications requiring analysis of lengthy documents, from legal cases to financial reports, yet their ability to handle real-world long-context tasks remains severely limited. A new benchmark, LooGLE v2, reveals that even the best-performing models achieve only 59.2% accuracy, exposing a critical gap between claimed context windows and actual performance in practical scenarios.
Researchers found that LLMs struggle with tasks demanding understanding of dependencies scattered across long texts. For instance, in legal domains, models must retrieve specific articles or cases from documents averaging 250,000 tokens, but they often fail to connect dispersed evidence. In finance, calculating metrics like cash flow margins or comparing annual reports requires integrating data from multiple sections, yet models frequently miss key details. This benchmark, built from real-world sources in law, finance, games, and code, shows that expanding context windows alone does not guarantee better comprehension.
The methodology involved automatically collecting texts ranging from 16,000 to 2 million tokens and generating 1,934 question-answer instances. These tasks were designed to mimic real-world challenges, such as extracting masked legal clauses, analyzing financial trends, understanding game rules from logs, or tracing code dependencies. Models were evaluated using tailored prompts and metrics like accuracy for multiple-choice questions and Jaccard similarity for version control tasks, ensuring robust assessment without fabrication.
Results indicate that performance degrades as input length increases, with models like GPT-4.1 maintaining a lead but still scoring poorly. For example, in legal article extraction, the top model achieved 69.35%, while in code-related tasks, accuracy dropped with deeper call chains, highlighting multi-hop reasoning failures. Retrieval-augmented generation methods did not consistently improve outcomes, suggesting that LLMs cannot effectively leverage localized chunks for global dependencies. Chain-of-thought prompting offered minor gains in finance but not in other domains, underscoring the complexity of long-range inference.
This research matters because it underscores limitations in AI systems deployed for critical tasks like legal analysis or financial forecasting, where errors could have real-world consequences. It emphasizes the need for models that genuinely understand long texts rather than merely processing them, guiding future development toward more reliable AI.
Limitations of the benchmark include its focus on only four domains and variations in task difficulty across contexts, which may affect cross-domain comparisons. Additionally, the reliance on automated curation, while scalable, might not capture all nuances of human-like reasoning in extremely long documents.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn