AI Agents Now Plan Like Humans, Boosting Accuracy

AI systems that use tools like search engines and calculators are transforming how complex tasks are solved, but they often operate inefficiently by handling one step at a time. A new method, Graph-based Agent Planning (GAP), addresses this by enabling AI to identify and execute independent tasks simultaneously, much like a human delegating work in parallel. This approach not only speeds up responses but also improves accuracy, making AI agents more practical for real-world applications such as research and data analysis.

Researchers discovered that GAP significantly enhances the performance of large language models (LLMs) in multi-step question-answering tasks. By modeling tasks as a dependency graph, the system determines which sub-tasks can be run in parallel and which must wait for others to complete. In experiments, GAP achieved an average 0.9% improvement in accuracy on multi-hop question-answering benchmarks compared to traditional methods, with notable gains in datasets like HotpotQA and Musique. It also reduced the number of tool interactions by up to 33.4% and decreased response times by 21.4% to 32.3%, demonstrating both efficiency and effectiveness.

The methodology involves a two-stage training process. First, the AI model undergoes supervised fine-tuning on a curated dataset of 7,000 multi-hop question-answering examples, generated using GPT-4o to ensure high-quality, diverse trajectories. This teaches the model to decompose complex queries into sub-tasks and analyze dependencies. Second, reinforcement learning is applied with a correctness-based reward function, optimizing the model to strategically invoke tools and maximize parallel execution. The system uses a graph structure where nodes represent sub-tasks and edges indicate dependencies, allowing for level-wise execution where independent tasks in the same level are processed simultaneously.

Analysis of the results shows that GAP outperforms baselines like Search-R1 and ZeroSearch, particularly in complex scenarios requiring multiple steps. For instance, on the HotpotQA dataset, GAP reduced the average number of turns from 2.69 to 1.78 and cut response length by 24.9%, from 554 to 416 tokens. This efficiency translates to lower computational costs and faster task completion, with the cost-of-pass metric—measuring expected expense per successful attempt—showing substantial improvements. The method's ability to generalize to out-of-domain datasets further supports its robustness, indicating that learned strategies transfer effectively to new contexts.

In practical terms, GAP's advancements could make AI assistants more reliable and cost-effective for applications like customer support, educational tools, and scientific research, where quick, accurate answers are crucial. By reducing reliance on sequential processing, it addresses a key bottleneck in current AI systems, potentially accelerating adoption in industries that demand high throughput and precision. However, the study notes limitations, such as the focus on question-answering tasks and the use of synthetic data for training, which may not fully capture real-world complexities. Future work could explore multi-objective rewards and broader task domains to enhance applicability.

AI Agents Now Plan Like Humans, Boosting Accuracy

About the Author

Guilherme A.