AI Judges Now Think More Like Humans

Large language models (LLMs) are increasingly used to evaluate everything from essays to chatbot responses, but their judgments often lack the nuanced reasoning that human experts apply. This limitation becomes critical in tasks requiring subjective assessment, where simple labels fall short. A new study from researchers at the University of Michigan, University of Minnesota, and University of California, San Diego, introduces a method to infer the 'thinking traces'—the internal reasoning steps—that guide human judgments, enabling AI evaluators to align more closely with human raters and improve reliability across diverse applications.

The key finding is that AI models can reconstruct human-like reasoning processes from basic annotations, using a technique called rejection sampling. This involves generating multiple potential reasoning paths for a given task and selecting those where the AI's final judgment matches the human label. These inferred thinking traces serve as practical proxies for the unobserved cognitive steps humans take, such as applying guidelines or resolving ambiguities, which are typically absent in standard datasets due to the high cost of collection.

Methodologically, the approach leverages reasoning language models (RLMs), which are designed to produce intermediate reasoning tokens before arriving at an answer. For instance, in evaluating story complexity, an RLM might generate critiques about plot coherence or character development. The researchers applied this to two main scenarios: fine-tuning open-weight LLMs to specialize as raters and refining annotation codebooks for black-box models like GPT-5 and Claude-4-Sonnet. In the fine-tuning case, models were trained not just on input-label pairs but on the complete inferred reasoning traces, encouraging them to learn the underlying judgment process. For codebook refinement, the traces were used to synthesize clearer, step-by-step instructions and rubrics, addressing ambiguities that lead to inconsistent ratings.

Results from experiments across five datasets—including story evaluation, essay scoring, and translation assessment—show significant improvements. For example, fine-tuned models achieved an average 42.6% increase in Kendall's Tau correlation with human judgments, a measure of ranking agreement. In codebook refinement, agreement metrics rose by up to 14.2%, and inter-rater reliability improved, with intraclass correlation coefficients increasing from 0.580 to 0.621 in some tasks. The study also highlights that refined codebooks help different AI models converge on similar ratings, reducing bias and enhancing consistency without requiring model retraining.

The implications extend to real-world settings where AI evaluators are deployed, such as educational grading or content moderation. By making AI judgments more transparent and aligned with human reasoning, this method could lead to fairer and more reliable automated systems. For instance, in story evaluation, it helps AI recognize nuanced elements like emotional resonance or structural flaws, rather than relying on superficial scores.

Limitations noted in the paper include the dependency on capable RLMs for effective trace inference; if the gap between AI and human judgment is too large, the method may struggle. Additionally, the current approach aggregates traces across raters rather than capturing individual annotator variations, and it does not explore real-time human-AI collaboration for trace validation. Future work could address these by developing strategies for broader reasoning coverage and integrating lightweight human feedback to further enhance fidelity.

AI Judges Now Think More Like Humans

About the Author

Guilherme A.