AI Fixes Complex C++ Bugs That Stumped Other Systems

TL;DR

A new AI understands code intent and structure to solve real-world C++ bugs, beating previous tools by over 10 percentage points.

Artificial intelligence systems have become remarkably adept at fixing bugs in Python code, but they've consistently struggled with the complexities of C++ programming. Now, researchers have developed the first AI system specifically designed to navigate C++'s intricate structures, achieving what previous approaches couldn't. This breakthrough matters because C++ powers critical systems from operating systems to game engines, where bugs can have serious consequences, yet automated repair tools have lagged behind those for simpler languages.

The researchers found that their system, called InfCode-C++, resolves 25.58% of real-world C++ issues on the MultiSWEbenchCPP benchmark. This represents a substantial improvement over previous state-of-the-art systems, outperforming the strongest prior agent by 10.85 percentage points and more than doubling the performance of MSWE-agent. When tested on 129 real GitHub issues from five actively maintained C++ libraries, the system demonstrated consistent superiority across different difficulty levels, solving 50% of easy issues, 25.42% of medium issues, and 9.52% of hard issues. These show that specialized approaches are necessary for C++ repair, where general-purpose AI systems designed for Python have shown drastic performance drops, with one leading configuration achieving only 14.7% resolution on C++ tasks.

The system works through a novel two-stage approach that addresses C++'s unique s. First, it uses semantic code-intent retrieval to understand what high-level feature a bug relates to, allowing it to narrow down the search space from entire repositories to relevant functional modules. Second, it employs AST-structured querying that navigates the abstract syntax tree of C++ code, enabling precise identification of specific classes, functions, and inheritance relationships that text-based search tools like grep cannot reliably find. These two mechanisms work together through a three-agent framework: a Reproducer Agent that creates executable tests from issue descriptions, a Patch Agent that performs the retrieval and generates candidate fixes, and a Selector Agent that validates and ranks patches through behavioral testing and voting.

The data reveals why this approach succeeds where others fail. In ablation studies, removing the semantic code-intent retrieval component caused performance to drop from 25.58% to 19.37%, while disabling AST-structured querying resulted in an even larger reduction to 17.05%. These components also improve efficiency, with the Patch Agent requiring an average of 28.1 turns to synthesize patches in the full system, compared to 35.3 turns without semantic retrieval and 45.3 turns without structural querying. Behavioral analysis shows that the system successfully reproduces issues 28.81% of the time, identifies the correct file containing defects 55.10% of the time, and pinpoints the exact function 42.10% of the time, with these localization capabilities directly enabling the final resolution rate.

This development has significant for software development and maintenance. C++ remains foundational to performance-critical applications across industries, from finance to automotive systems, where manual bug fixing consumes substantial developer time and resources. The system's ability to handle complex C++ features like overloaded identifiers, nested namespaces, template instantiations, and deep inheritance hierarchies means it could accelerate maintenance cycles for large codebases that have resisted previous automation attempts. By releasing InfCode-C++ as open-source software, the researchers have provided a foundation that other teams can build upon, potentially leading to more reliable systems in domains where C++ predominates.

Despite these advances, important limitations remain. The system successfully reproduces issues less than 30% of the time, indicating that many C++ bugs involve complex runtime contexts, build-system configurations, or platform-specific behaviors that aren't fully captured in issue descriptions. Localization accuracy also decreases from file-level (55.10%) to function-level (42.10%), showing that fine-grained defect identification remains challenging even with structural analysis. The evaluation is limited to five C++ libraries from a single benchmark, and might differ in other domains like embedded systems or mixed-language repositories. Additionally, the system's performance depends on the underlying language model, with GPT-5 achieving 25.58% resolution compared to 13.20% with DeepSeek-V3, though both outperform their respective baselines.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn