Large language models (LLMs) are increasingly used in high-stakes fields like medicine and law, where factual accuracy is critical. However, current methods to check if AI-generated text is true often rely on slow, expensive fact-checking that struggles with long, open-ended responses. This new research introduces a fast, low-cost way to predict when LLMs produce nonfactual content, addressing a key barrier to trustworthy AI deployment.
The key finding is that the uniformity of text embeddings—called semantic isotropy—reliably signals factual consistency. When an LLM generates multiple responses to the same prompt, if those responses are factually grounded, their embeddings cluster tightly on a unit sphere. In contrast, if the model hallucinates, the embeddings spread out, increasing angular dispersion. The researchers measure this dispersion using von Neumann entropy of a cosine kernel derived from the embeddings, resulting in a semantic isotropy score between 0 and 1. A lower score indicates higher trustworthiness, as shown in their experiments where it consistently outperformed existing methods.
Methodologically, the approach is model-agnostic and requires no labeled data, fine-tuning, or hyperparameter selection. For a given prompt, the researchers sample a handful of independent responses from an LLM, embed them using any off-the-shelf encoder, and compute the isotropy score based on the embeddings' dispersion. They also developed Segment-Score, a new protocol for evaluating factuality in long-form text, which decomposes responses into atomic statements and verifies them against a reference document using an oracle LLM. This method is efficient, scaling well with response length and providing clear criteria for labeling statements as true or false.
Results from empirical evaluations demonstrate that semantic isotropy achieves state-of-the-art performance in predicting nonfactuality. On benchmarks like TriviaQA and FactScore-Bio, with responses around 500 words, it explained up to 65% of the variance in factuality scores, outperforming baselines such as perplexity, entropy-based measures, and LUQ-Atomic. The method proved robust across different LLMs, embedding models, and response lengths, requiring only 6-8 samples to deliver reliable indicators. For instance, using Nomic Embed v1 embeddings, it achieved an R² of 0.65 on FactScore-Bio, significantly higher than other metrics.
In real-world contexts, this innovation matters because it enables scalable trust assessment in applications like automated reporting or educational tools, where quick verification of AI outputs is essential. By reducing computational costs—taking about 1.8 seconds per topic on a V100 GPU compared to minutes for other methods—it makes reliable fact-checking feasible for everyday use, potentially preventing errors in critical decisions.
Limitations include the need for repeated sampling, which can be expensive for generation, though advances in fast inference engines mitigate this. The paper notes that the method may not perform as well in cases of intrinsic hallucinations, where responses are contradictory but rare in practice, and its effectiveness can vary with the generator model's capabilities, as seen with lower performance on TriviaQA for highly capable models like GPT-4.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn