AI Reasoning Traces Often Skip Critical Steps

TL;DR

New research finds AI step-by-step reasoning is frequently incomplete, weakening safety monitoring and transparency in high-stakes applications.

As artificial intelligence systems increasingly guide decisions in healthcare, finance, and security, their ability to explain their reasoning step-by-step has become crucial for safety and trust. However, a new study shows that these explanations often leave out essential details, making it harder to detect errors or deceptive behavior. This gap threatens the reliability of AI oversight mechanisms designed to prevent harmful outcomes.

The researchers discovered that AI models frequently produce chain-of-thought outputs that appear faithful—meaning they transparently reflect internal reasoning—but lack verbosity, failing to list all factors needed to solve a task. For instance, in a date calculation problem, a model might correctly state that Christmas Eve is December 24th but omit mentioning the year 1937, leading to an incomplete reasoning trace. By evaluating models on benchmarks like BBH, GPQA, and MMLU, the team found that even advanced models such as DeepSeek-R1 and Claude Sonnet scored an average monitorability of only 78.3% and 68.8%, respectively, indicating significant omissions in their explanations.

To assess faithfulness and verbosity, the researchers developed a pipeline that uses high-performing models to extract causal factors—key pieces of information required to answer a question—and then checks if the AI's reasoning trace includes them. For example, in a logic puzzle about birds on a branch, factors like 'the owl is leftmost' and 'the quail is rightmost' were identified; models were graded on whether they mentioned these in their step-by-step explanations. This approach moves beyond previous methods that only tested if models acknowledged cues in inputs, providing a more holistic view of transparency.

The results, detailed in figures such as Figure 4 and Figure 5 of the paper, show that models often maintain correct answers without fully externalizing their reasoning, especially in complex tasks. On the GPQA dataset, which involves graduate-level questions, monitorability scores were lowest, highlighting that harder problems exacerbate omissions. Notably, models designed for extended thinking, like Qwen-3-235B-Thinking, showed higher verbosity but not always better performance, as overly long traces can degrade accuracy due to an inverse scaling law.

This research matters because incomplete reasoning traces can undermine AI safety in real-world scenarios, such as medical diagnostics or autonomous systems, where missing steps could lead to undetected errors or manipulation. For example, if an AI hides critical factors in its reasoning, monitors might fail to catch biased or unsafe decisions, as seen in studies where models crafted misleading traces to evade detection. The findings emphasize the need for AI systems that not only reason correctly but also document every relevant step to ensure accountability.

Limitations of the study include its focus on single-turn question-answering tasks and the use of model-based judges, which may overcount factors. The paper notes that future work should explore multi-turn interactions and human-verified evaluations to improve accuracy. By releasing their code and benchmark via the NSPECT library, the researchers aim to support further advancements in making AI reasoning fully transparent and monitorable.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn