AI Models Struggle to Compare Texts Reliably

Large language models (LLMs) are increasingly used to compare and rank texts in applications from content moderation to academic reviews, but their reliability has been largely unverified. A new study by Tianyi Li from The Chinese University of Hong Kong addresses this gap by measuring LLMs' error rates in pairwise text comparisons, uncovering that these models often fail to provide consistent judgments, especially as the number of texts increases. This finding is critical for anyone relying on AI for decision-making, as it highlights inherent limitations in automated evaluations that could affect fairness and accuracy in real-world scenarios.

The key discovery is that LLMs exhibit significant error rates when comparing texts, with estimates ranging from 5% to 50% depending on the model and text type. For instance, in experiments with meaningless pseudo-word paragraphs, error rates approached 50%, indicating near-random performance. The researchers found that models like Qwen performed best overall, while others, such as ChatGPT and Gemini, showed higher error rates, particularly with ambiguous content. This inconsistency means that LLMs cannot be trusted for precise rankings without external validation.

To assess error rates, the researchers employed a method based on pairwise comparisons without relying on ground truth. They presented LLMs with pairs of texts—such as advertising slogans, short poems, and academic abstracts—and asked them to indicate which was better. By analyzing the models' preferences across repeated comparisons and swapping text orders, they calculated error probabilities. This approach used matrices to track inconsistencies, such as when a model preferred one text in one order but the other when positions were reversed, revealing underlying errors in judgment.

The results, detailed in figures like Figure 1 and Figure 2, show that error rates increase as more texts are compared, making the method non-scalable. For example, with uniform error assumptions, the probability of correctly ranking texts dropped uniformly with larger sets, as seen in simulations where the deviation from ideal scores grew. In one test with ChatGPT, the best-fit error rate was 13%, and commutativity scores—measuring consistency across order swaps—averaged around 39%, indicating frequent flip-flops in preferences. Positional bias further complicated results, with error rates differing based on which text was presented first.

This research matters because it exposes risks in using LLMs for high-stakes tasks like hiring or content curation, where inconsistent judgments could lead to biased outcomes. For instance, if an AI system evaluates job applications or scientific abstracts unreliably, it might favor certain candidates unfairly. The study's real-world implication is that developers and users should incorporate error estimates to mitigate hallucinations—where models generate incorrect or deceptive outputs—ensuring AI aids rather than misleads human decision-making.

Limitations from the paper include the assumption that comparison sequences are independent, which may not hold in practice, and the focus on zero-shot prompts without reasoning. Future work could explore how adding explanations or using retrieval-augmented techniques might reduce errors, but for now, the study underscores that LLMs' comparison abilities are error-prone and require careful validation in applied settings.

AI Models Struggle to Compare Texts Reliably

About the Author

Guilherme A.