A new study s the growing reliance on AI tools to review code generated by other AI systems, arguing that without clear, executable specifications, this approach is fundamentally flawed. The research, based on three hypotheses and supporting experiments, finds that AI reviewers often fail to catch errors because they check code against itself rather than against the original intent, leading to correlated failures that echo rather than cancel out. This structural problem is highlighted by industry data, such as the DORA 2026 report, which shows that higher AI adoption correlates with both increased throughput and instability, as time saved in code generation is spent on auditing. The paper suggests that this circularity in AI-assisted pipelines could compromise software quality if not addressed with better practices.
The key finding is that AI reviewers, when operating without external specifications, exhibit correlated errors with the code-generating AI, meaning they share the same blind spots and fail to detect certain bugs. This is supported by empirical evidence from multiple 2025-2026 studies, including one that coined the term "popularity trap" to describe how models converge on wrong answers. In contrived experiments, the researchers tested this with planted bugs in Python functions, finding that AI review detection rates varied from 0% to 100% depending on domain opacity. For example, in one experiment, a bug involving log-linear interpolation in finance was missed entirely by AI reviewers, while BDD scenarios caught it deterministically. The study refines this claim, noting that classic boundary conditions are reliably caught, but domain-specific conventions often go undetected.
Ology involved developing three interconnected hypotheses and conducting small, contrived experiments to test them. The first hypothesis focuses on correlated errors in homogeneous AI pipelines, where both generator and reviewer lack an external reference. Experiments used five Python functions with single planted bugs, comparing AI review without specifications to BDD scenarios with precise specifications. The second hypothesis applies the Cynefin framework to argue that executable specifications transform problems from the complex domain to the complicated domain by converting enabling constraints into governing constraints. The third hypothesis proposes a taxonomy of defect classes based on specifiability, grounding it in oracle problem theory. The experiments, while directional and not statistically significant, included same-family and cross-family model reviews, with publicly available for replication.
From the experiments show a gradient in AI review effectiveness. In Experiment 1, classic boundary-condition bugs were detected at 100% by AI review, similar to BDD. Experiment 2, with domain-convention violations, saw AI review detection rates ranging from 0% to 100%, while BDD caught all bugs. For instance, a bug in insurance premium proration was caught 100% of the time, but one in log-linear interpolation was missed entirely. Experiment 3 extended this to a cross-family panel of four models, confirming the gradient: detection varied by domain opacity and model family, with some models confidently asserting wrong conventions as correct. The ICD-10-CM external cause code rule was missed by all models in 20 runs, while BDD caught it every time, providing strong directional evidence for the hypothesis.
Of this research are significant for software development practices, suggesting that a specification-first architecture could make AI-assisted pipelines more reliable. By writing executable specifications like BDD scenarios, teams can move problems into a domain where cause and effect are analyzable, reducing reliance on probabilistic AI review for core logic. The proposed architecture prioritizes specifications, followed by deterministic verification pipelines, with AI review reserved for structural and architectural residuals. This approach addresses the bottleneck shift identified in the DORA 2026 report, where specification and verification become scarce resources. Real-world applications could include industries like finance or healthcare, where domain conventions are critical and often missing from AI training data.
Limitations of the study are openly acknowledged, including the use of a planted bug corpus rather than natural defects, which gives BDD scenarios an unfair advantage. The experiments are directional and not statistically significant, with small sample sizes and contrived conditions. For example, the cross-family panel included two models from the same Anthropic family, limiting diversity. The Cynefin mapping remains unvalidated by the Cynefin community, and the defect taxonomy is novel and untested. Additionally, the research cannot test bugs based on unpublished internal policies, which represent the most consequential cases of correlated error. These gaps highlight areas for future work, such as controlled studies with natural defect samples and broader model diversity.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn