AI Checks Its Own Code to Boost Software Quality

TL;DR

A new method lets AI generate and validate requirements, test cases, and BDD scenarios, raising quality 55% and cutting manual review by 60%.

Large Language Models (LLMs) are rapidly automating the creation of software quality documents, but ensuring these AI-generated outputs meet high standards has been a persistent . A new systematic technique developed by IBM researchers addresses this by enabling AI to baseline and evaluate its own work using quantifiable metrics, bridging the gap between automation and accountability. This approach is crucial for software development teams who rely on AI for scalable quality engineering but need to maintain rigorous standards without excessive manual oversight.

The key finding from the research is that reverse-generated artefacts can significantly enhance quality, especially when initial inputs are weak. In experiments across 12 software projects involving over 150 requirement-test case pairs, reverse-generated requirements showed improvements of up to 55% in testability and 48% in completeness compared to low-quality original inputs. For example, when original requirements were of low quality, reverse generation led to measurable gains, while high-quality inputs consistently yielded superior outputs, demonstrating 's dual role as both a quality enhancer and validation mechanism. The researchers also found that requirements derived from Behavior-Driven Development (BDD) scenarios outperformed those from plain test cases, with a 15% higher testability score and stronger semantic preservation.

Ology combines LLM-driven generation, reverse generation, and iterative refinement guided by a rubric of quality metrics. The process begins with generating initial QE artefacts like test cases or BDD scenarios using LLMs based on input requirements. Next, a reverse generation step reconstructs the original inputs from these artefacts, enabling consistency checks. To compare artefacts, the technique uses SBERT (Sentence-BERT) to encode sentences into dense vector representations and compute cosine similarity scores, categorizing them as High (>0.8), Medium (0.6-0.8), Low (0.3-0.6), or No Match. These scores guide targeted recommendations for human-in-the-loop intervention, such as merging, refining, or retaining segments. The unified artefact then undergoes iterative refinement cycles evaluated against predefined rubrics focusing on clarity, completeness, consistency, and testability, with each cycle driven by metric-based feedback until quality thresholds are met.

Experimental detailed in the paper show clear trends in artefact quality. For instance, Table 1 from the study reveals average scores across artefact types: original high-quality requirements scored 4.7 in clarity, 4.5 in completeness, 4.3 in testability, and 4.6 in consistency, while reverse-generated requirements from BDD artefacts scored 4.3, 4.4, 4.5, and 4.2 respectively. Low-quality original inputs scored as low as 2.1 in clarity and 1.8 in testability, but after refinement, reverse-generated versions showed significant improvements. Table 2 illustrates metric improvement across refinement cycles, with low-quality inputs increasing from an average score of 2.6 in Cycle 1 to 4.1 in Cycle 3, while high-quality inputs plateaued after two cycles. SBERT similarity analysis confirmed these improvements, with scores for low-quality inputs rising from approximately 0.35 to 0.62 across cycles, indicating enhanced semantic alignment.

Of this technique are substantial for real-world software development, offering scalable quality assurance with reduced manual effort. By automating metric-based evaluation and semantic validation, the approach cuts manual review effort by 60-70%, making it suitable for Agile and DevOps environments where speed and reliability are paramount. It also promotes sustainable AI practices by reducing redundant LLM operations by 30%, with estimated energy savings of 21 kWh per refinement cycle and a CO2 equivalent reduction of 0.008 tons, as shown in Table 3. The framework's adaptability across artefact types and domains positions it for enterprise-scale deployment, helping teams generate high-quality QE artefacts at scale while ensuring traceability and contextual relevance.

Despite its strengths, the technique has limitations that must be considered. It remains dependent on input quality, requiring either high-quality inputs or comprehensive artefacts like test cases or BDD scenarios to achieve optimal outputs. LLMs may overlook implicit domain knowledge, leading to superficially valid but semantically flawed artefacts, though SBERT helps mitigate this by quantifying semantic alignment. Tooling gaps exist, as current implementation relies on Excel-based frameworks for outputs like Semanticand ImpactAnalysis, necessitating enterprise-scale integration for seamless adoption. Teams also need training in rubric design, prompt engineering, and interpreting semantic similarity thresholds, which may pose skill requirements. Future work could address these by developing integrated platforms, expanding rubrics to include domain-specific metrics, and validating the technique in cross-industry pilots to enhance robustness and adoption.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn