AI Code Assistants Fail at Writing Secure Software

TL;DR

A new benchmark shows LLMs routinely produce working code full of security flaws, revealing major gaps in AI-assisted development.

As artificial intelligence becomes a standard tool for writing software, a fundamental question remains unanswered: can AI-generated code be trusted to be secure? A new study introduces DUAL GAUGE, the first automated system to rigorously evaluate both the functional correctness and security of code produced by large language models (LLMs). reveal a troubling reality: while models like GPT-5 can generate code that works correctly about half the time, they often fail dramatically on security, with only about 12% of their outputs being both functional and secure. This gap highlights a significant risk as developers increasingly rely on AI assistants, potentially introducing vulnerabilities into production software without realizing it.

The researchers developed DUAL GAUGE-BENCH, a benchmark suite of 154 diverse coding tasks, each paired with comprehensive test suites for both functionality and security. Unlike previous benchmarks that measured these aspects separately or lacked security tests entirely, this new dataset enables joint evaluation, ensuring that code must satisfy all requirements simultaneously. The security tests are designed to check for common vulnerabilities like SQL injection, cross-site scripting, and unsafe resource handling, based on established guidelines such as OWASP practices and CERT standards. The functional tests cover normal operations, edge cases, and error conditions, with an average of about 6.5 tests per task for each category, ensuring thorough coverage of specification requirements.

To automate the evaluation, the DUAL GAUGE system employs an agentic executor that runs generated code in isolated sandbox environments, resolving dependencies and runtime issues to ensure tests can be executed. This is paired with an LLM-based evaluator that assesses whether the code's behavior matches expected outputs for functional tests and secure behaviors for security tests. The system was validated for accuracy, with the executor achieving 95.08% precision and 84.67% recall, and the evaluator achieving 90.54% precision and 77.91% recall, indicating reliable alignment with human judgment. This automation allows for scalable testing across thousands of scenarios, addressing the limitations of manual inspection or static analysis tools that are often inaccurate.

From benchmarking ten leading LLMs, including GPT-5, Claude models, Gemini, and open-source options like Qwen3, reveal critical insights. GPT-5 achieved a pass@1 score of 50.65% for functional correctness but only 11.69% for secure correctness (secpass@1), showing a dramatic decline when both requirements must be met. Other models, such as Claude-Sonnet-4.5, dropped from 46.10% to 4.55%, and GPT-4 fell to just 0.65% secpass@1. The study also found that security does not scale linearly with model size; for the Qwen3 family, security performance plateaued beyond 4 billion parameters, with minimal improvements up to 32 billion. Additionally, certain quantization s, like FP8, unexpectedly improved security over full-precision baselines, while others degraded it, and reasoning mechanisms showed non-monotonic effects, with medium levels often optimal but excessive reasoning sometimes introducing vulnerabilities.

These have significant for software development practices. They suggest that current AI coding assistants, while accelerating productivity, may inadvertently compromise security, as models prioritize functional completion over vulnerability prevention. The research indicates that standard instruction tuning, which improves task following, can reduce joint security-functionality performance, highlighting the need for security-aware training paradigms. For practitioners, the study recommends careful model selection based on security requirements, with GPT-5 offering the best security performance but smaller models like Qwen3-8B providing a cost-effective open-source alternative. The release of DUAL GAUGE as an open-source tool aims to foster progress by enabling reproducible, rigorous evaluation, helping developers and researchers better understand and mitigate the risks of AI-generated code.

Despite its advancements, the study acknowledges limitations. The LLM-based evaluator, while accurate, may misclassify about 9% of positive assessments and 22% of correct behaviors, and the agentic executor's trace fidelity, though high, is not perfect. The test suites, though comprehensive, cannot cover all possible vulnerabilities or behaviors, and execution non-determinism in programs may affect outcomes. Additionally, the benchmark focuses on code generation from natural language prompts, not code completion, and while it is language-agnostic in design, the current evaluation primarily involved Python and similar languages. These constraints underscore the need for ongoing refinement and highlight that human oversight and traditional security practices remain essential complements to AI-assisted development.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn