A new study examining how AI agents behave when given the same task multiple times has uncovered a surprising pattern: consistency in these systems often amplifies errors rather than ensuring correctness. Researchers from Snowflake AI Research tested three leading language models—Claude 4.5 Sonnet, GPT-5, and Llama-3.1-70B—on 10 real-world software engineering tasks from the SWE-bench benchmark, running each task five times to measure behavioral variance. reveal that while more consistent agents tend to be more accurate overall, they frequently make the same mistakes repeatedly, with 71% of Claude's failures classified as 'consistent wrong interpretation.' This s the common assumption that consistency automatically translates to reliability in AI systems deployed for complex, multi-step reasoning tasks.
The key finding from the research shows a clear hierarchy in both consistency and accuracy across the three models tested. Claude achieved the highest consistency with a coefficient of variation (CV) of 15.2% and the highest accuracy at 58%, meaning it solved 29 out of 50 attempts correctly. GPT-5 showed intermediate performance with 32.2% CV and 32% accuracy (16/50 correct), while Llama demonstrated the highest variance at 47.0% CV and the lowest accuracy at just 4% (2/50 correct). The coefficient of variation measures how much step counts vary across runs, with lower values indicating more consistent behavior. Despite these differences, all models produced unique action sequences in 100% of runs, showing that even consistent agents don't follow identical paths but maintain similar strategic approaches.
Ology involved using the SWE-bench Verified benchmark, which requires agents to resolve real GitHub issues through multi-step code modifications. Each task demanded exploration of large codebases, understanding of bug context, implementation of fixes, and verification through testing. The researchers selected 10 tasks from the astropy repository, chosen for diversity in bug types and fix complexity, with median fix sizes of 4.5 lines. All models had access to identical tools including bash commands for file system navigation, code editing, and test execution, running in isolated Docker containers with temperature set to 0.5 for moderate stochasticity. They measured consistency using the coefficient of variation of step counts and accuracy through the official SWE-bench evaluation harness that applies patches and runs test suites.
Analysis of reveals several important patterns. The data shows that consistency amplifies outcomes rather than guaranteeing correctness—when Claude correctly interpreted a task, it succeeded on all five runs, but when it misinterpreted a task, it failed on all five runs. This 'consistent wrong' pattern accounted for 71% of Claude's failures. The research also uncovered a speed-accuracy-consistency tradeoff: GPT-5 was 4.7 times faster than Claude (9.9 vs 46.1 steps on average) but achieved 1.8 times lower accuracy and 2.1 times worse consistency. Phase decomposition analysis showed that Claude invested heavily in understanding (41.2% of actions), GPT-5 emphasized verification (32.3%), and Llama spent more time exploring (28.1%). Surprisingly, Claude and GPT-5 diverged at similar steps in their trajectories (3.2 vs 3.4) yet achieved very different consistency levels, indicating that early agreement alone doesn't determine behavioral variance.
Of these are significant for real-world AI deployment. The research suggests that interpretation quality, not execution consistency, is the primary bottleneck for reliable agent performance. This means current approaches focusing on better tool use or longer trajectories may need to shift toward improving initial task understanding. The study also reveals that consistency can be a double-edged sword—valuable when the approach is correct but dangerous when it leads to reliable repetition of errors. For benchmarking practices, the finding that 100% of runs produce unique sequences suggests single-run evaluations may be misleading, and multi-run assessments with consistency reporting should become standard. The speed-accuracy-consistency triangle identified in the research means practitioners must choose priorities based on context: rapid prototyping may favor faster models while production systems may require more reliable ones.
Limitations of the study include a sample size of only three models and 10 tasks, which allows identification of trends but not establishment of strong statistical relationships. All tasks came from the astropy repository, and different codebases or bug types might show different patterns. The research used temperature 0.5 throughout, and different temperature settings could change both consistency and accuracy. The study observes correlation and provides mechanistic explanations but cannot prove causality between consistency and accuracy—an intervention study forcing models to explore more or less would strengthen causal claims. Additionally, the failure mode labeling was not blinded, though the researchers note this limitation and release all trajectories for independent verification.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn