AI-Written Smart Contracts Fail Most Real-World Tests

TL;DR

A new study finds only a small fraction of AI-generated smart contracts work correctly, exposing a major gap in automated blockchain development.

Large language models (LLMs) like ChatGPT and Gemini are increasingly used to generate code, but their reliability for creating smart contracts—self-executing programs on blockchains—has been uncertain. A new study from the University of Molise systematically tests this capability, revealing that while AI can produce code that appears similar to human-written contracts, functional correctness remains low. This matters because smart contracts handle high-stakes assets like cryptocurrencies and sensitive data, where errors can lead to significant financial losses or security breaches. The research underscores the need for caution in deploying AI-generated blockchain code without thorough validation.

The key finding is that LLMs generate smart contracts with high semantic similarity to real ones but often fail to function correctly. In a zero-shot setting, where models produce code from simple prompts without additional context, only 20% to 26% of the generated functions behave identically to ground-truth implementations under automated testing. For example, ChatGPT-4o achieved a functional correctness rate of 20.79%, while Gemini reached 25.64%, based on a dataset of 500 real-world Solidity functions. This discrepancy highlights that code that looks plausible on the surface may not execute as intended, posing risks for blockchain applications where precision is critical.

Ology involved benchmarking four state-of-the-art LLMs: ChatGPT-4o, Gemini 1.5 Flash, CodeLlama, and DeepSeek-Coder-v2. The researchers used two generation settings: zero-shot prompts and retrieval-augmented generation (RAG), where models were provided with similar smart contract examples. They evaluated the generated code across multiple dimensions, including code similarity (using metrics like BLEU and Tree Edit Distance), functional plausibility through automated test execution, gas consumption profiling, and complexity analysis (cognitive and cyclomatic complexity). The dataset comprised 500 Solidity functions from real blockchain projects, ensuring relevance to practical scenarios.

Analysis shows that while semantic similarity scores were high—with ChatGPT-4o averaging 0.6813 on a scale where 1.0 indicates perfect similarity—functional plausibility was low. In the zero-shot setting, only 7.35% to 9.09% of generated contracts passed all tests without failures. However, RAG significantly improved performance: DeepSeek-Coder-v2 with RAG boosted functional correctness to 45.19%, and Gemini with RAG increased to 37.01%. Gas consumption analysis revealed that generated code was consistently more efficient, using less gas on average than manually written contracts, but this often came at the cost of omitted validation logic, as illustrated in figures comparing ground-truth and generated functions. Complexity metrics indicated that AI-generated code was simpler, with lower cognitive and cyclomatic complexity, potentially affecting robustness.

For real-world applications are significant. Smart contracts are central to decentralized finance and data certification, and errors can have costly consequences. The study suggests that while RAG enhances plausibility, achieving production-ready code remains challenging, necessitating expert validation. For instance, the researchers observed that generated functions frequently lacked critical checks, such as input validation using require statements, which could lead to vulnerabilities. This calls for developers to use AI-generated smart contracts cautiously, integrating them into workflows that include rigorous testing and human oversight to ensure security and functionality.

Limitations of the study include potential threats to validity, such as the use of Ganache for gas profiling, which may not perfectly replicate mainnet conditions, and the dataset's focus on Solidity up to a certain period, possibly missing newer language features. The semantic similarity metric relied on SmartEmbed, a model from 2019, which might not capture all modern Solidity idioms. Additionally, the sample of 500 functions, while statistically significant, may not represent all smart contract scenarios. Future research could explore broader datasets and real-time deployment testing to further assess AI's capabilities in blockchain development.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn