AI Agents Struggle to Simplify Complex Code

As artificial intelligence systems increasingly handle software engineering tasks, their ability to understand and manipulate entire codebases has become crucial. Researchers have introduced a new benchmark called Gistify that reveals significant limitations in current AI models when tasked with extracting and simplifying code functionality.

The core finding shows that even state-of-the-art AI frameworks struggle to reliably create minimal, self-contained code files that reproduce specific functionality from larger codebases. When given a code repository and a command to execute, the AI must generate a single file containing only the essential components needed to run that command successfully. The best-performing combination of AI model and framework achieved only 58.7% success rate on this task, indicating substantial room for improvement.

Researchers tested this approach using several popular AI frameworks including SWE-Agent, mini-SWE-Agent, and GitHub Copilot, combined with leading language models like GPT-5, GPT-5-mini, Claude-3.7-Sonnet, and Claude-Sonnet-4. The methodology involved providing each AI system with a Docker image containing a target codebase and a specific entry point command. The AI then had to generate a condensed file that could execute independently while producing identical outputs to the original codebase.

The results revealed clear patterns in failure modes. Import errors accounted for 32.5% of failures with Claude-Sonnet-4, while file creation failures affected 20% of attempts. Missing function errors were particularly problematic for GPT-5 models, occurring in 76.3% of failed cases. The data showed that AI models frequently modified test functions despite explicit instructions to preserve them, with these modifications strongly correlating with task failure (correlation=0.76).

This research matters because AI systems are increasingly deployed to work with large, complex codebases in real-world software development. The ability to extract and simplify code functionality has practical applications for debugging, code review, and sharing implementations without inheriting heavy dependencies. The generated 'gistified' files themselves could become valuable artifacts that help human developers understand complex systems more easily.

The study acknowledges that current AI performance drops sharply on more difficult subsets of the task, particularly those involving longer execution traces and higher code coverage. Even when provided with execution feedback tools, AI models showed only small consistent gains in performance. The research also found that simpler AI setups with restricted tool access sometimes outperformed more complex configurations, suggesting that current models struggle to effectively leverage available tools for this type of code understanding task.

AI Agents Struggle to Simplify Complex Code

About the Author

Guilherme A.