Large language models (LLMs) have transformed coding with tools like GitHub Copilot, but their ability to generate complete, functional classes for real software projects remains poorly understood. A new study reveals that these models excel on synthetic benchmarks but struggle dramatically with actual code, exposing a critical gap in AI-assisted programming that affects developers relying on these tools for everyday tasks.
The researchers discovered that LLMs achieve 84–89% correctness on established synthetic benchmarks like ClassEval, but their performance plummets to just 25–34% when generating classes for real-world projects. This stark disparity—a 53–62 percentage point drop—shows that models handle simple, isolated tasks well but fail when faced with the complexities of real codebases. Notably, models performed nearly identically on both seen and unseen real-world data, indicating that memorization of training data does not explain their shortcomings.
To evaluate this, the team created RealClassEval, a benchmark derived from 400 open-source GitHub classes split into pre-cutoff (likely seen during training) and post-cutoff (guaranteed unseen) partitions. They tested seven state-of-the-art LLMs, including GPT-4.1, DeepSeek-V3, and Codestral, using a standardized prompt that provided class skeletons. The models generated implementations, which were then assessed for functional correctness using automatically generated test suites from PYNGUIN, focusing on pass rates as the key metric.
Results showed that comprehensive docstrings—detailed documentation of classes and methods—yielded only modest improvements of 1–3% in correctness, with these gains being rare and not statistically significant for most models. However, retrieval-augmented generation (RAG), which provides relevant code examples, proved more effective, boosting accuracy by 4–7% specifically when docstrings were incomplete. This approach helped models supply concrete implementations where specifications were absent, though it introduced occasional dependency conflicts.
Error analysis identified AttributeError, TypeError, and AssertionError as the dominant failure modes, accounting for 84% of cases. Synthetic benchmarks overemphasized assertion issues, while real-world scenarios highlighted attribute and type mismatches, underscoring that LLMs master syntax but struggle with object-oriented semantics like inheritance and method consistency. RAG reduced some errors but sometimes added new ones, such as ImportError, by copying incompatible dependencies from retrieved examples.
For software developers and companies using AI coding assistants, these findings emphasize setting realistic expectations: LLM-generated code requires rigorous review and testing before deployment, especially for complex class-level tasks. The study suggests that future benchmarks must better reflect real-world challenges, and LLMs need enhanced training on object-oriented concepts to improve reliability in practical applications.
Limitations include the sample size of 400 classes, which, while larger than previous benchmarks, may not capture all real-world variability. The research also focused solely on Python, and findings might differ for other programming languages. Despite this, the work provides a foundational understanding of LLM capabilities, urging the community to prioritize semantic correctness over syntactic performance in AI-driven development tools.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn