A new study reveals that most empirical research using large language models (LLMs) in software engineering cannot be reliably reproduced, raising concerns about the validity of findings in this rapidly growing field. Researchers from multiple European institutions analyzed 86 studies presented at top conferences and found that only five were even suitable for reproduction attempts, with none fully reproducible. This issue threatens the foundation of scientific progress, as irreproducible results waste resources and hinder cumulative knowledge.
The key finding is stark: out of 65 studies using OpenAI's commercial LLM services, only five could be attempted for reproduction, and none yielded fully matching results. Two studies were partially reproducible, while three showed significant deviations. For example, in one study on code translation, the reported success rate of 79.9% for Python translations dropped to a range of 45.6% to 55.3% in reproduction attempts. This inconsistency highlights the fragility of LLM-based research outcomes.
Methodology involved a systematic analysis of studies from the International Conference on Software Engineering (ICSE) and the Automated Software Engineering (ASE) conference in 2024. The team developed a containerized framework to execute replication attempts, running experiments multiple times (up to 15 repetitions per study) to account for LLM variability. They focused on studies using OpenAI models due to their prevalence and applied Bayesian bootstrapping to estimate confidence intervals for results, ensuring a rigorous approach to assessing reproducibility.
Results analysis showed that common issues impeded reproduction, including incomplete artifacts (missing code or data), dependency version conflicts, and deprecated models. For instance, 35 studies lacked necessary data, and 15 had non-executable artifacts. Even studies with ACM artifact badges, intended to signal reliability, often failed to meet requirements; 11 out of 19 badged studies had artifacts that were non-functional or insufficiently documented. In one case, a study on prompt augmentation for code summarization showed no reproducible results across multiple runs, with performance metrics fluctuating significantly.
Contextually, this reproducibility crisis matters because LLMs are increasingly used in critical areas like software development, where unreliable findings could lead to flawed tools or wasted investments. For everyday readers, it underscores that AI advancements, often touted as breakthroughs, may be built on shaky evidence. This affects trust in AI-driven innovations and calls for better standards to ensure research can be verified and built upon.
Limitations from the paper include the focus on only OpenAI models and two conferences, which may not generalize to all LLM research. Additionally, budget constraints limited the number of experiment repetitions, and manual interventions in reproduction could introduce bias. The non-deterministic nature of LLMs, influenced by factors like temperature settings and model updates, remains a fundamental challenge, leaving some variability unavoidable.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn