Your AI Assistant Is Writing Buggy Code

When developers use AI assistants like GitHub Copilot or ChatGPT to write code, they might assume the biggest security risk comes from hackers trying to trick the AI. But new research reveals a more common danger: the way we ask for code in the first place. A study from Peking University shows that poorly written prompts—even when created with good intentions—dramatically increase the likelihood that AI will generate vulnerable code.

The researchers discovered a clear correlation between prompt quality and code security. They developed a framework to measure what they call "prompt normativity" across three dimensions: goal clarity, completeness, and logical consistency. When prompts were vague, incomplete, or contradictory, the AI-generated code contained significantly more security vulnerabilities across all major AI models tested.

The team created CWE-BENCH-PYTHON, a comprehensive benchmark containing 165 coding tasks across 33 common security weakness categories. Each task was paired with four levels of prompts—from highly normative (L0) that provided clear, complete specifications to highly non-normative (L3) that were vague and sometimes contradictory. They tested ten AI models including GPT-4o, Gemini, Claude, and several open-source alternatives.

The results were striking. As prompt quality decreased from L0 to L3, vulnerability rates increased dramatically. For complex security tasks like access control (CWE-284), vulnerability rates jumped from 13.59% at L0 to 49.84% at L3—more than a 150% increase. Similarly, for exception handling (CWE-697), rates rose from 23.12% to 58.59%. The pattern held across all models, with larger models showing even more pronounced effects.

The researchers explain this phenomenon through what they call the "path of least resistance" principle. When faced with clear specifications, AI models adopt a professional engineering mindset. But with vague prompts, the AI must guess at requirements and defaults to the simplest implementation—which is often the least secure. For example, when asked to handle user input without clear sanitization instructions, AI might use simple string concatenation instead of proper validation.

For everyday developers, this means that writing clear, complete requirements isn't just about getting the right functionality—it's essential for security. The way we phrase our requests to AI coding assistants directly impacts the safety of the resulting code. This finding shifts the focus from solely examining AI's internal capabilities to considering the human-AI interaction quality.

The study also tested mitigation strategies. Chain-of-Thought prompting, where the AI breaks down problems step-by-step, significantly reduced vulnerability rates, especially for complex tasks. Self-correction methods, where AI reviews and refines its own code, also showed protective effects, though they were less consistent across all scenarios.

While the research focused on single-function Python code, the implications extend to broader software development practices. As AI coding assistants become increasingly integrated into development workflows, this study highlights that improving prompt quality represents a practical, immediate strategy for enhancing code security without requiring changes to the underlying AI models themselves.

Your AI Assistant Is Writing Buggy Code

About the Author

Guilherme A.