AI Code Models Fail to Follow Developer Instructions

TL;DR

A new benchmark shows top AI models often ignore developer preferences when adjusting code, revealing a key gap in real-world usability and trust.

As artificial intelligence becomes integral to software development, ensuring that AI-generated code aligns with human preferences is crucial for productivity and reliability. A new study from Apple researchers introduces a benchmark that evaluates how well large language models (LLMs) follow developer instructions for code adjustments, revealing significant shortcomings in current systems. The key finding is that models frequently fail to adhere to stylistic and structural changes requested by developers, even when the code is functionally correct. For example, a model might generate a list comprehension when a developer explicitly asks for a loop for better readability. This gap persists across leading models like GPT-5 and Gemini, indicating that instruction-following is a distinct challenge separate from code correctness.

The researchers developed this benchmark by first creating a catalog of 228 verified instructions sourced from real developers. They conducted a user study with 30 experienced programmers who compared code pairs and provided natural-language instructions to transform one version into their preferred one. These instructions were categorized into types such as cosmetic (e.g., improving comments), structural (e.g., using loops instead of comprehensions), and semantic (e.g., changing algorithms). The benchmark then tests models in two settings: predefined instructions, where constraints are embedded in the initial prompt, and follow-up instructions, where adjustments are requested after code generation. Models are evaluated using an automated verifier that checks if the instruction was followed, with human validation showing 87% agreement.

Results from testing 10 models across Python, Java, and JavaScript show that models perform better with follow-up instructions than predefined ones, with improvements of up to 24% in some cases. For instance, in Python, the median success rate for follow-up instructions was 0.181 higher than for predefined instructions. However, performance varies by instruction type: structural changes are handled best, while semantic and cosmetic adjustments see lower success rates. Models like GPT-5 and GPT-5 mini lead in performance but still exhibit failures, such as not adding test cases when instructed, as shown in Figure 6 where Claude Sonnet 4 failed but GPT-5 mini succeeded. Radar plots in Figures 7-9 illustrate these differences across model families, with shaded areas indicating standard error.

This research matters because it addresses a real-world need: developers rely on AI not just for correct code but for code that matches their team's standards and maintainability goals. In industries from tech to finance, inconsistent code can lead to higher debugging costs and reduced collaboration. The benchmark's modular design allows for ongoing updates, helping track progress as models improve. Limitations include the benchmark's focus on standalone code problems rather than large codebases, and the potential for bias in automated verification, though the study used multiple judges to mitigate this. Future work could expand to multi-turn interactions, enabling models to refine code iteratively based on feedback.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn