As large language models (LLMs) become integral to software development, the code they produce is increasingly used in real-world applications. However, this generated code often suffers from structural issues known as code smells—patterns that harm readability, maintainability, and design integrity. These problems can lead to costly fixes and reduced software longevity, making it crucial for developers and organizations to understand and address them. A new study provides a systematic approach to measuring, explaining, and mitigating these smells, offering practical solutions that could improve the reliability of AI-assisted coding.
The researchers discovered that the Propensity Smelly Score (PSC), a metric estimating the likelihood of LLMs generating specific code smells, serves as a robust indicator of structural quality. By analyzing PSC across various conditions, they found that it remains stable under semantic-preserving code transformations for 76% of smell types, such as C0116 and R0917, where p-values were 1 and effect sizes near zero. This stability means PSC captures deep generative tendencies rather than superficial token patterns. Additionally, PSC showed higher information gain than traditional metrics like BLEU and CodeBLEU, meaning it better explains the presence of severe smells—instances where over 50% of tokens in a snippet are smelly. For example, in comparisons, PSC consistently provided more discriminative signals about code quality, aligning closely with structural concerns overlooked by similarity-based measures.
To investigate how smells emerge, the team employed a causal analysis framework, examining factors like generation strategy, model size, model architecture, and prompt formulation. They used structured causal models to estimate average treatment effects (ATEs), adjusting for confounders such as lines of code, token count, and syntactic features. For instance, in the generation strategy intervention, sampling-based s like top-p sampling reduced PSC for smells like C2801 (unnecessary-dunder-call), indicating lower smell propensity, while non-greedy decoding increased it for formatting-related smells like C0303 (trailing-whitespace). Model size had minimal impact, with ATEs close to zero across variants, suggesting that simply scaling parameters does not improve code quality. In contrast, model architecture showed strong effects, with negative ATEs for smells like R1705 (no-else-return) when comparing different 7B parameter models, highlighting how design choices influence output.
Underscore that prompt design and model architecture are the most influential factors in reducing code smells. Prompt-based interventions, such as using structured prompts that explicitly avoid smells, led to significant decreases in PSC. In a mitigation case study, the median PSC for W0611 (unused-import) dropped from 0.52 to 0.23 when switching from a minimal prompt to one emphasizing clean code practices. Similarly, for W0719 (broad-exception-raised), PSC decreased from 0.80 to 0.67. These demonstrate that actionable changes during inference, without retraining models, can substantially improve code quality. The causal analysis confirmed these effects are robust, passing refutation tests like placebo checks and sensitivity analyses, ensuring the estimates are reliable and not due to random variations.
A user study involving developers revealed that PSC aids in practical code assessment. Participants exposed to PSC scores were more likely to identify smells as systematically introduced by models, prioritize them for fixes, and feel confident in their judgments. For example, for W0612 (unused-variable), those with PSC access had a statistically significant increase in attributing the smell to model behavior (p = .002). Thematic analysis of open-ended responses showed that developers used PSC as a heuristic to resolve uncertainties, especially for subtle issues, enhancing their ability to evaluate generated code in real-world scenarios. This suggests that integrating PSC into development tools could support better decision-making and trust in AI outputs.
Despite these advances, the study has limitations. The research focused primarily on Python and decoder-only models, so may not generalize to other languages or architectures without adaptation. Additionally, the causal analysis accounted for common confounders but omitted deeper design-level signals like call graph patterns, which could introduce residual confounding. Future work could expand PSC to diverse programming ecosystems and develop comprehensive mitigation strategies beyond prompt adjustments. Overall, this research lays a foundation for embedding quality-aware evaluations into LLM deployment, helping ensure that AI-generated code is not just functional but also maintainable and robust.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn