AI-powered software assistants can now autonomously fix bugs in real-world codebases, but when they fail, developers are left in the dark about why. Current evaluation s, which simply check if a patch passes unit tests, collapse an entire execution into a single binary outcome—success or failure. This approach offers no insight into where the agent went wrong or how to improve it. To address this critical limitation, researchers have developed TRAJEVAL, a diagnostic framework that breaks down agent behavior into three interpretable stages, revealing systematic inefficiencies and distinct failure modes across different AI models.
TRAJEVAL decomposes an agent's execution trajectory into search, read, and edit stages, comparing each against a reference patch to compute precision and recall metrics. The search stage measures file-level localization—whether the agent finds the correct files in the repository. The read stage assesses function-level comprehension—whether it examines the relevant code within those files. The edit stage evaluates modification targeting—whether it makes changes in the right locations. By analyzing 16,758 trajectories across three agent architectures (SWE-Agent, OpenHands, LiveSWEAgent) and seven language models ranging from 8 billion to 480 billion parameters, the researchers uncovered universal patterns of over-exploration. All agents examined 22 times more functions than necessary, with read stage precision hovering around just 4-5%, indicating massive inefficiency in code navigation.
Ology relies on comparing agent actions against a 'golden context' derived from reference patches, which implicitly define the minimal code elements required to solve a task. For each stage, precision captures efficiency (how much of the agent's exploration was necessary), while recall captures effectiveness (how much of the required context was successfully identified). The framework extracts six features from trajectories: precision and recall for search, read, and edit. These features are then used to train lightweight logistic regression classifiers to predict task success without needing ground-truth patches at inference time. The researchers validated their approach on SWE-bench Verified (500 Python issues from 12 repositories) and PolyBench Verified (382 multilingual issues across Python, Java, JavaScript, and TypeScript from 20 repositories), ensuring robustness across different programming environments.
Show that trajectory features accurately predict model-level Pass@1 rates within 0.87-2.1% mean absolute error under in-distribution evaluation, with perfect ranking correlation preserved across models. More importantly, the diagnostics reveal that recall, not precision, strongly correlates with task completion—particularly at the edit stage, where higher-performing models achieve broader coverage of necessary code regions. Different models exhibit distinct failure modes: GPT-5 locates relevant code but targets edits incorrectly (low edit recall), while Qwen-32B fails at file entirely (low search recall). These insights are actionable; when researchers provided real-time feedback during execution—simple signals like 'You are looking at a relevant file'—they improved Pass@1 by 2.2-4.6 percentage points for GPT-5 and Qwen3-Coder-480B while reducing token usage by up to 29% and costs by 20-31%.
Extend beyond mere diagnosis. By transforming opaque agent behavior into interpretable metrics, TRAJEVAL enables mechanism-driven optimization. For instance, low search recall indicates file limitations that might be addressed with improved retrieval techniques, while low edit recall suggests localization failures that could benefit from better planning. The universal precision deficit highlights a systematic inefficiency across all current agent architectures, pointing to a need for more focused exploration strategies. This framework moves evaluation beyond outcome-based benchmarking toward understanding how agents succeed or fail, which is crucial for deploying reliable AI assistants in production development workflows.
However, the approach has limitations. It assumes reference patches represent canonical solutions, though analysis of 334 instances resolved by multiple models shows 93.3% exact function-level convergence, suggesting well-defined fix locations for the benchmarks studied. The intervention study relies on knowing the reference patch for feedback, which isn't available in real deployment—practical systems would need learned predictors. Additionally, the evaluation focuses on bug-fixing tasks in open-source repositories; generalization to other domains like security or performance optimization requires further validation. Despite these constraints, TRAJEVAL provides a foundational tool for diagnosing and improving AI code agents, offering a clearer path to more efficient and effective automated software engineering.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn