AI Agents Fail at Cost-Efficient Planning, Study Finds

TL;DR

A new benchmark shows advanced AI models fail to adapt when costs change, with performance dropping 40%, revealing a key gap in real-world reliability.

Artificial intelligence agents, designed to handle complex tasks like travel planning, often overlook a crucial skill: adapting plans to save resources when conditions change. This gap in cost-aware planning could limit AI's effectiveness in real-world applications where budgets and environments are unpredictable. A new study introduces CostBench, a benchmark that evaluates how well AI agents minimize costs while adjusting to dynamic scenarios, revealing significant weaknesses in current models.

Researchers found that leading AI models, including GPT-5 and Gemini-2.5-Pro, perform poorly in cost-optimal planning. In static settings, GPT-5 achieved a 75% exact match rate for the most difficult tasks, meaning it followed the correct plan only three-quarters of the time. However, under dynamic conditions where costs change mid-task, this rate dropped to approximately 40%, indicating a severe decline in adaptability. Even the best models struggled to maintain cost-efficiency when faced with environmental disruptions.

The methodology centered on CostBench, a scalable framework built around travel-planning tasks. Agents used composite tools with randomized costs to complete multi-step plans, such as selecting destinations and accommodations. The benchmark introduced dynamic events, like tool failures or cost changes, to simulate real-world unpredictability. For example, if a transportation option became more expensive, agents had to replan their route to minimize expenses. This approach tested not just completion but the ability to reason about costs adaptively.

Results showed that AI models are highly sensitive to cost variations and environmental perturbations. In experiments with ten leading models, performance metrics like average normalized edit distance and exact match ratio worsened as task complexity increased. Under cost-change conditions, exact match ratios fell sharply, with models often failing to identify the cheapest paths. The study also highlighted that models frequently made redundant or invalid tool calls, such as repeating steps or using incorrect parameters, which inflated costs and reduced efficiency.

This research matters because cost-efficient planning is essential for AI applications in areas like logistics, healthcare, and personal assistants, where resource constraints are common. If AI cannot adapt to changing costs, it may lead to wasted resources and unreliable outcomes in critical systems. The findings underscore the need for developing more robust AI agents that can learn and optimize in dynamic environments.

Limitations of the study include its focus on the travel-planning domain, which may not capture all real-world cost trade-offs. Additionally, the simulation abstracted away factors like API latency and stochastic failures, which could influence agent behavior in practice. Future work could expand to other domains and incorporate more realistic environmental shifts to better assess AI adaptability.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn