A new automated system can fix serious bugs in the Linux operating system kernel for less than twenty cents each, using large language models (LLMs) in a way that s more complex and expensive approaches. Researchers from the University of California, Riverside, and Columbia University have developed RGym, a lightweight framework that enables AI to repair memory corruption vulnerabilities in the kernel—the core of the operating system that manages hardware and security—with surprising effectiveness and minimal cost. This breakthrough addresses a critical gap in automated program repair (APR), where previous s either struggled with low success rates or relied on costly cloud infrastructure, making kernel debugging more accessible and affordable for developers and researchers.
Key from the study show that their approach achieves pass rates up to 43.36% on a dataset of 143 verified kernel bugs, using GPT-5 Thinking, while keeping costs under $0.20 per bug. This performance is notable compared to prior work like kGym, which reported only 5.38% success with oracle assistance, and CrashFixer, which achieved 65.6% but at a high cost of $21.62 per bug. The researchers demonstrated that simpler designs, combined with practical localization techniques, can yield comparable to more complex systems without unrealistic assumptions, such as knowing exactly which files to patch in advance. For instance, their Simple Agent with feedback-driven retries reached a 37.76% pass rate using GPT-4o, and the combined success across different configurations hit 68.53%, highlighting the value of diversification in repair strategies.
Ology centers on RGym, a platform-agnostic evaluation framework that runs on local commodity hardware, unlike earlier systems dependent on Google Cloud Platform. RGym automates kernel compilation and testing using Docker for dependency management and QEMU for virtual machines, streamlining the process and reducing the need for extensive domain knowledge. The APR pipeline incorporates two main agents: a Simple Agent that uses bug-type specific instructions and a Function Exploration Agent that allows LLMs to request additional code definitions to understand root causes. A critical innovation is the use of realistic localization, leveraging bug-inducing commits (BICs) and call stacks from crash reports instead of relying on oracles, with BICs obtainable through tools like SymBisect that achieve 75% accuracy. Function-wise patching is employed to generate patches by having LLMs list candidate functions, receive their definitions, and return patched versions, reducing errors from imprecise diff generation.
Analysis, detailed in Table 1 of the paper, reveals that function-wise patching alone improved kGym's success from 2.8% to 10.49% by cutting bad patches by 76%. Adding BIC-based localization boosted the Simple Agent's pass rate to 21.67%, and incorporating feedback for up to three retries increased it further to 37.76%, with 23.77% of bugs solved on the first attempt. The study also found that state-of-the-art LLMs like GPT-5 Thinking outperformed others, achieving the highest pass rate at a low cost, while Claude Opus 4.1 reached 32.16% but was more expensive at $0.73 per bug. An ablation study isolated the contributions of different components, showing that feedback and retries provided diminishing returns beyond three attempts, and that compilation failures dropped sharply with advanced models, indicating improved code generation consistency.
This research has significant for software development and cybersecurity, as it makes kernel bug repair more practical and cost-effective, potentially accelerating fixes for critical vulnerabilities like out-of-bounds memory access and use-after-free errors. By operating locally and avoiding expensive cloud dependencies, RGym lowers barriers for researchers and organizations with limited budgets, promoting wider testing and innovation in automated repair. suggest that complex, high-cost APR s may not be necessary for many kernel bugs, encouraging a shift toward simpler, modular approaches that can be combined for better . This could lead to more resilient operating systems and reduced security risks, as faster patching helps protect against exploits in core system components.
Limitations of the study include the dataset's focus on 143 KASAN bugs from Syzbot, which, while severe, may not represent all kernel bug types, and the reliance on BICs that are not always available for unpatched bugs. The paper notes that patch correctness verification is insufficient based solely on crash prevention, with manual analysis showing only 32.23% of patches were plausibly correct, similar to rates reported by CrashFixer. Additionally, the evaluation did not test against CrashFixer directly due to its closed-source nature, leaving open questions about the necessity of its complex strategy. Future work could expand to other bug categories and improve localization accuracy, but the current framework sets a strong foundation for affordable, effective kernel repair.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn