Large language models increasingly shape how we access information, from summarizing news articles to answering medical queries. Yet their reliability remains questionable, particularly when they generate content that isn't explicitly wrong but isn't fully verifiable either. A new study reveals that even the most advanced AI systems struggle with this 'gray zone' of ambiguity, producing outputs that challenge conventional methods for detecting unfaithfulness.
Researchers developed a novel framework that categorizes AI-generated content into seven distinct faithfulness classes. The most revealing finding shows that approximately 6% of sentences generated by state-of-the-art models like GPT-5 fall into ambiguous categories where verification depends on external knowledge or interpretation. More significantly, about 8% of generated content falls into the 'Out-Dependent' category, requiring external world knowledge to verify—content that current detection methods often miss.
The research team constructed VeriGray, a new benchmark dataset containing 2,044 sentences from 412 documents, specifically designed to test AI faithfulness in summarization tasks. Using rigorous human annotation assisted by attribution tools, they classified sentences into categories ranging from Explicitly-Supported (content directly verifiable from the source) to Out-Dependent (content requiring external knowledge) and Fabricated (completely unsupported content). The annotation process involved multiple quality assurance stages, including training phases, consistency checks, and LLM-assisted error detection.
Analysis of the VeriGray benchmark reveals critical limitations in current detection methods. When evaluated against this new standard, even the best-performing AI detectors achieved only 75-80% accuracy, with performance dropping significantly for Out-Dependent cases. The study found that models like DeepSeek-R1 and GPT-5, while most effective, still struggle to identify content that requires external verification. This gap highlights a fundamental challenge: current systems lack access to the knowledge sources needed to properly evaluate ambiguous content.
The implications extend beyond academic interest. In practical applications like medical summarization or financial analysis, ambiguous but potentially unfaithful content could lead to misinterpretations with real-world consequences. The research demonstrates that resolving this ambiguity is crucial for developing trustworthy AI systems, particularly as language models become more integrated into high-stakes decision-making processes.
However, the study acknowledges important limitations. The current benchmark focuses solely on summarization tasks, leaving open questions about how ambiguity manifests in other knowledge-grounded applications like question answering or data-to-text generation. Additionally, the research doesn't address how to build detectors that can access external knowledge sources effectively—a critical next step for improving detection accuracy.
The findings suggest that future AI development must prioritize handling ambiguous content, potentially through enhanced knowledge integration or more sophisticated verification mechanisms. As language models continue to evolve, addressing this gray zone problem will be essential for ensuring their reliability across diverse applications and user contexts.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn