Software developers spend countless hours explaining their needs to AI assistants, only to receive responses that miss the mark. A new approach bridges this communication gap by giving AI the ability to understand developer intent and preferences over multiple interactions, achieving an 86% acceptance rate in real-world testing.
The key finding from Carnegie Mellon University researchers is that pairing a standard coding AI with a specialized "theory of mind" partner dramatically improves how well AI assistants understand what developers actually want. This dual-agent system, called ToM-SWE, allows AI to track user preferences, constraints, and interaction patterns across multiple coding sessions, rather than treating each request as an isolated command.
The methodology employs a novel architecture where a primary software engineering agent handles code generation and execution tasks, while a dedicated partner agent focuses exclusively on modeling the user's mental state. This separation allows each component to specialize—the coding agent maintains its technical performance while the theory-of-mind agent builds persistent models of user preferences, coding styles, and interaction patterns. The system operates in two modes: during active sessions, it infers the true intent behind potentially ambiguous instructions, and after sessions, it consolidates interaction history to refine its understanding of the user.
Results from comprehensive testing show substantial improvements. On the newly introduced stateful benchmark, which evaluates AI performance across multiple interactions with simulated users, ToM-SWE achieved a 59.7% success rate compared to 18.1% for the state-of-the-art OpenHands system—a 43.9 percentage point improvement. User satisfaction scores, automatically measured by simulators, showed a 41% improvement, with ToM-SWE scoring 3.62 compared to 2.57 for baseline systems. The system was particularly effective at handling cases where user instructions were ambiguous or underspecified, successfully resolving 63.4% of such cases compared to 51.9% for standard approaches.
Perhaps most compelling were the real-world results from a three-week study with 17 professional developers using ToM-SWE in their daily work. Across 174 instances where the system provided suggestions, developers accepted or partially accepted the recommendations 86% of the time. The system proved most effective for understanding tasks (92% acceptance) and development work (82.5% acceptance), with developers reporting that the AI "helps me out with rules I already set in my previous conversations" and "creates an accurate profile of me."
The practical implications are significant for software development workflows. Developers typically waste substantial time re-explaining their preferences and constraints to AI assistants. This system demonstrates that AI can learn and adapt to individual working styles, remembering preferences for testing practices, documentation habits, architectural choices, and communication styles across multiple coding sessions. The approach shows particular strength in handling moderately underspecified instructions where sufficient context exists from previous interactions.
Limitations noted in the paper include the computational overhead of additional AI inferences, though the researchers found this to be modest—adding only 16% to the average session cost. The system also struggles with extremely vague instructions that lack sufficient context, and its performance depends on having clear problem boundaries. The user simulators used for evaluation, while cost-effective, may introduce systematic biases, and the developer profiles used for testing may not represent the full diversity of global programming practices.
The research validates that effective human-AI collaboration in software engineering requires systems that can proactively adapt to user mental states rather than simply responding to surface-level commands. As AI becomes increasingly integrated into development workflows, this approach points toward more intuitive, personalized assistance that understands not just what developers say, but what they actually mean.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn