AI Agents Team Up to Fix Software Bugs More Reliably

TL;DR

A new framework pits test-writing AI against code-fixing AI, pushing both to improve until bugs are reliably patched. Better fixes, fewer false passes.

A new AI system called InfCode has demonstrated a novel approach to automating software bug fixes by making AI agents compete against each other. Developed by researchers from Beihang University and Beijing Tokfinity Technology, this framework addresses a persistent in software engineering: large language models often generate patches that pass existing tests but fail to fully resolve underlying defects. By introducing an adversarial loop where one agent strengthens tests while another refines code, InfCode pushes both components toward higher reliability, achieving a 79.4% success rate on a rigorous benchmark, setting a new standard in the field.

The key finding from the research is that adversarial iteration between test and code generation significantly improves patch quality. In experiments on the SWE-bench Lite dataset, InfCode solved 121 out of 300 problems using the DeepSeek-V3 model, outperforming strong baselines like KGCompass, which solved 110 problems. This represents a 40.33% resolved rate, the highest among s using similar models. Moreover, InfCode addressed 11 more unique problems than KGCompass, as shown in Figure 2, highlighting its ability to fix issues that other systems miss. On the more stringent SWE-bench Verified subset, InfCode achieved a 79.4% success rate with Claude 4.5 Sonnet, ranking first on the leaderboard and surpassing s like TRAE + Doubao-Seed-Code at 78.80%.

Ology centers on a multi-agent framework with two specialized agents operating in a containerized environment. A Test Patch Generator creates and strengthens test cases based on issue descriptions, aiming to expose incorrect behavior more effectively. A Code Patch Generator responds by producing improved code modifications to pass these tests. This adversarial interaction continues iteratively, with the Test Generator identifying weaknesses and adding stronger tests, while the Code Generator refines its patches. To prevent endless loops, the process has a hard cap on iterations and terminates if code passes all strengthened tests. A third agent, the Selector, evaluates all candidate patches based on metrics like functional correctness and test coverage, choosing the most reliable one. The system uses tools like Bash Tool, Editor, Searcher, Submitter, and Executor within a Docker container for reproducible execution.

Analysis reveals the effectiveness of each component. Ablation studies in Table 2 show that removing the adversarial iteration module reduced performance to 36.33% resolved, while removing the selection module dropped it to 32.33%, indicating both are crucial, with selection having a larger impact. Tool invocation data in Figure 3 shows the Bash Tool was called most frequently (302.6 times per problem on average) but had a 10.04% failure rate, often due to attempts to run complex scripts or nonexistent commands. The Editor, used 167.01 times on average, had a 5.80% failure rate, while Searcher and Submitter had lower failure rates. Despite these errors, the overall low failure rates demonstrate system robustness. The framework's performance on SWE-bench Verified, as detailed in Table 3, confirms its state-of-the-art status, with InfCode solving 397 problems out of 500.

Of this research are significant for software development and AI automation. By improving the reliability of automated bug fixes, InfCode could reduce the time and effort developers spend on routine maintenance, allowing them to focus on more complex tasks. The adversarial approach mimics a quality assurance process where rigorous testing drives better code, potentially leading to more stable software in real-world applications. As an open-source project, it offers a practical tool for integrating AI into development workflows, with potential extensions to other programming languages and ecosystems. This advancement highlights how AI can move beyond simple code generation to more sophisticated, iterative problem-solving in software engineering.

Limitations of the framework include issues with test generation accuracy and tool invocation errors. The paper notes that the Test Generator sometimes creates specialized tests that deviate from issue descriptions, misleading the Code Generator and reducing patch quality. This aligns with prior work showing LLM-based agents can produce incorrect test patches. Additionally, the Bash Tool and Editor exhibit invocation failures, with the Bash Tool's high failure rate due to environmental unfamiliarity. Future improvements could focus on enhancing test faithfulness and refining tool implementations. The researchers also acknowledge threats to validity, such as excluding test file modifications from evaluations to avoid bias, but emphasize the framework's language-agnostic design for broader applicability.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn