AI Agents Struggle With Real-World Business Decisions

TL;DR

A theme park simulator shows top AI systems reach only a fraction of human performance at planning, learning, and spatial reasoning combined.

Artificial intelligence has excelled in narrow tasks like board games and professional exams, but real-world decision-making—such as running a business—requires a blend of skills that current systems struggle to master. A new benchmark called Mini Amusement Parks (MAPs) exposes these limitations by simulating the complex, open-ended s of managing an amusement park, where success depends on coordinating multiple interdependent decisions over time. The researchers found that state-of-the-art AI agents lag far behind humans, achieving only 7.16% of human performance on average, with even the best model taking over two hours to complete a game that humans finish in 50 minutes. This gap underscores a pressing need for AI that can handle the integrated demands of real-world scenarios, from planning to adaptation under uncertainty.

The core finding from the MAPs evaluation is that AI systems, including models like GPT-5, Grok-4, Claude Sonnet 4.5, and Gemini 2.5 Pro, perform poorly when faced with the multi-faceted objective of maximizing park value. In easy mode, the top-performing model, GPT-5, achieved only 13.89% of human performance, and this dropped to 7.16% in medium mode, where planning demands increase. Human players scored an average park value of 835,318 in easy mode and 2,062,681 in medium mode, while GPT-5 managed 116,109 and 147,699 respectively. The analysis revealed common weaknesses: agents often adopted sub-optimal strategies, such as over-prioritizing expensive rides like roller coasters, underutilizing shops, and waiting excessively instead of taking proactive actions. These behaviors highlight a myopic approach that fails to balance short-term gains with long-term outcomes, a critical flaw in business-style decision-making.

To assess these capabilities, the researchers designed MAPs as a 20x20 grid simulator where players act as park managers, making daily decisions on building rides, hiring staff, setting research agendas, and managing inventory. The environment includes six main components: terrain, rides, shops, staff, subclasses with research mechanics, and guests with stochastic behaviors. In medium difficulty, research is introduced, requiring agents to unlock higher-tier attractions through strategic investment, which amplifies the need for long-horizon planning. The evaluation used a ReAct baseline, where agents were conditioned on past actions and observations, with access to game documentation. Additional tests included a sandbox mode for active learning, spatial reasoning heuristics, and world-modeling approaches like WALL-E to probe specific s such as sample efficiency and uncertainty handling.

Detailed in the paper show persistent weaknesses across five key areas. First, on open-ended objectives, all AI systems underperformed humans by large margins, with GPT-5 being the best but still far from parity. Second, in long-horizon planning, relative performance dropped significantly from easy to medium mode, indicating struggles with foresight and sequencing. Third, active world-model learning in sandbox mode provided little benefit; most models failed to generate useful insights, with GPT-5 showing minor improvements but others declining. Fourth, spatial reasoning was lacking, as a simple heuristic outperformed AI placements, improving GPT-5's score from 13.89% to 21.55% in easy mode. Fifth, stochasticity posed s, with revenue and money showing high variance, and world-modeling approaches like WALL-E degrading performance, though an oracle model offered a 4x improvement, suggesting potential if accurate modeling can be achieved.

These have significant for the development of AI in practical domains like business management, logistics, and resource allocation. The inability to integrate planning, learning, and spatial reasoning limits AI's applicability in real-world settings where decisions are interconnected and uncertain. For instance, the poor performance in MAPs mirrors s in optimizing supply chains or managing customer interactions, where long-term strategies and adaptive learning are crucial. The benchmark provides a diagnostic tool to track progress, encouraging research into more robust agents. However, the paper notes limitations, such as the high computational cost and time required for AI evaluations—GPT-5 took an average of 7,583 seconds in medium mode—and the need for better stochastic modeling. Future work could expand MAPs with harder difficulties and additional mechanics, but for now, it serves as a stark reminder of how far AI must go to match human decision-making agility.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn