AI Tool Finds Hundreds of Hidden Bugs in AI Systems

TL;DR

An automated system uncovers critical memory safety flaws in CUDA kernels powering large language models, preventing crashes and security breaches.

The widespread adoption of large language models (LLMs) has made GPU-accelerated inference a critical part of modern computing infrastructure, powering everything from chatbots to scientific computing. However, the CUDA kernels that execute core transformer operations on GPUs are highly susceptible to memory-safety bugs due to complex tensor layouts, intricate memory indexing, and massive thread-level parallelism. These bugs can corrupt model weights, crash entire inference services, or even enable adversarial attacks that compromise systems. Existing techniques for detecting such flaws either depend on unavailable hardware, incur high runtime overhead, or fail to handle the variable inputs typical in LLM inference, leaving a significant security gap in production AI systems.

Researchers have developed Model2Kernel, the first practical system for automatically verifying the memory safety of CUDA kernels used in LLM inference. The system performs model-aware dynamic analysis to determine how each model invokes kernels and classifies kernel arguments as either fixed by the model architecture or controlled by model users. Using this information, Model2Kernel applies CUDA-specialized symbolic execution, supported by new abstractions for dynamic tensor memory and thread identifiers, to accurately pinpoint memory bugs in kernels. In evaluations, it discovered 353 previously unknown bugs while producing only nine false positives, demonstrating a false rate of 2.49% and highlighting its effectiveness in real-world scenarios.

Ology behind Model2Kernel involves two major components working in tandem. First, HFProbe acts as a dynamic model profiler that analyzes how models from platforms like Hugging Face invoke CUDA kernels. It executes each model without requiring GPU hardware, profiling kernel invocations and automatically identifying which inputs are determined by the model architecture and which are user-controlled. This component also mutates model configurations to trigger more kernels, ensuring comprehensive analysis. Second, cuKLEE is a symbolic execution engine specialized for CUDA kernels. It models tensors with dynamic shapes as distinct memory regions and introduces symbolic variables for tensor properties like base address and dimensions, enabling it to handle over 100 tensor s by adding appropriate constraints. Additionally, it models CUDA thread identifiers as symbolic variables, allowing unified symbolic execution that captures both thread-shared and thread-specific computations in a single pass, scaling to thousands of threads.

In terms of , Model2Kernel was evaluated on CUDA kernels and models from three sources: all text-generation models in vLLM and related kernels, all Hugging Face models with custom kernels, and models and kernels from four recent research publications. In total, it identified 353 memory bugs, including 328 integer overflows and 25 out-of-bounds accesses. For example, it detected an integer overflow in a kernel where the product of token count and hidden size exceeded the int32 limit, leading to subsequent out-of-bounds accesses. The system also compared favorably with baseline techniques like Honeycomb, GKLEE, and ESBMC-GPU on 20 known vLLM CUDA kernel bugs, successfully detecting 15 of them, far more than the alternatives. An ablation study confirmed that both HFProbe and cuKLEE are critical for achieving this level of effectiveness, as removing either component significantly reduced bug detection or increased false positives.

Of this research are substantial for the reliability and security of AI infrastructure. Memory-safety bugs in CUDA kernels can lead to severe consequences, such as triggering 'illegal memory access' errors that crash inference services or enabling exploits that modify arbitrary GPU memory or achieve arbitrary code execution. Since inference systems interact with external users and may receive attacker-controlled inputs, these vulnerabilities pose a serious security risk. Model2Kernel enhances safety by allowing model developers to verify that existing CUDA kernels correctly support new model architectures and expected input token counts, and enabling CUDA developers to ensure new kernel optimizations do not introduce memory bugs while remaining compatible with deployed models. This proactive detection helps prevent potential system compromises and service disruptions in production environments.

Despite its effectiveness, Model2Kernel has limitations that point to future research directions. It currently focuses on Hugging Face models and CUDA kernels with tensor inputs, which cover a large portion of real-world usage but do not support components like Triton-based kernels or TensorFlow models. Additionally, while configuration mutation increases coverage, the collected contexts still represent only a small fraction of real LLM inference scenarios, suggesting a need for more systematic s to explore broader, realistic contexts. In experiments, cuKLEE analyzes each kernel for up to one hour, but for certain complex kernels, not all paths are explored, potentially missing some memory bugs. Optimizing path exploration strategies specifically for CUDA programming patterns could improve efficiency and coverage in future iterations of the tool.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn