Software development is a complex, resource-intensive process that traditionally relies on expert collaboration, but recent advances in large language models (LLMs) offer the promise of automation. However, existing AI approaches often use linear, waterfall-style pipelines that oversimplify real-world projects and struggle with large-scale complexity. To address this, researchers have developed EvoDev, a framework inspired by feature-driven development that improves how AI agents create software, making it more reliable and functional for applications like Android apps.
The key finding is that EvoDev significantly outperforms existing AI-based methods. In evaluations, it achieved a perfect build success rate of 100% for Android applications, meaning every app compiled without errors, and it improved function completeness by 56.8% compared to the best-performing baseline, Code. This framework also boosted single-agent performance by 16.0% to 76.6% across different base LLMs, highlighting its effectiveness in handling intricate software tasks.
Methodologically, EvoDev decomposes user requirements into a list of user-valued features and constructs a Feature Map—a directed acyclic graph that models dependencies between features. Each node in the map stores multi-level information, including business logic, design, and code, which is propagated to guide subsequent iterations. The process involves three stages: constructing a structured requirements document, generating the Feature Map to organize features into cohesive sets, and iteratively developing features with agents like a Chief Programmer that handle design and implementation. A memory mechanism optimizes code changes by keeping file versions unique and reducing context clutter, enhancing efficiency.
Results from the APPDev dataset, which includes 15 Android apps across categories like Utility and Entertainment, show EvoDev's superiority. For instance, with Claude-4-Sonnet, it reached a function completeness score of 3.56 out of 4, compared to 2.27 for Code, and maintained high non-functional metrics like usability and stability. The framework proved robust across tasks of varying difficulty, with the largest improvements in intermediate complexity apps where baseline methods often failed. Ablation studies confirmed that both the overall design construction and Feature Map generation contributed to gains, adding 7.2% and 8.2% relative improvements in function completeness, respectively.
In context, this matters because it enables AI to handle real-world software projects more effectively, such as developing mobile apps with complex lifecycles and dependencies. For everyday users, it could lead to faster, more reliable app creation and reduce development costs. The approach balances coding capability with instruction-following, making it a cost-efficient solution—for example, with GPT-4.1, it achieved high completeness at a reasonable monetary cost of around $1.02 per app.
Limitations noted in the paper include the lack of human-in-the-loop interaction and automated testing integration. The framework does not yet allow for real-time human input or handle test code generation effectively, which could be areas for future improvement. Additionally, the study focused on Android and Kotlin, so generalizability to other platforms and languages remains unexplored.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn