AI Agents Fail at Complex Restaurant Tasks, New Study Finds

TL;DR

A new benchmark shows standard AI methods fall short in high-stakes settings, while a knowledge-based approach delivers better real-world results.

Artificial intelligence systems often stumble when faced with real-world complexity, such as managing a busy restaurant where every decision counts. Researchers from the National University of Singapore have developed a benchmark called DinerDash to test AI in high-dimensional action spaces, revealing that common methods like behavior cloning and reinforcement learning fall short without expert knowledge. This matters because it highlights a critical gap in AI's ability to handle tasks like traffic control or logistics, where mistakes can have cascading effects.

The key finding is that traditional AI benchmarks, such as Atari games, are too simple for evaluating performance in complex scenarios. DinerDash simulates a restaurant environment where an AI controls a waitress serving customers, with 57 possible actions—like seating groups, taking orders, and cleaning tables. In this setup, customers have happiness levels that decrease over time; if they drop to zero, the customer leaves, and the AI loses a life. The game ends after five such losses, making it a high-stakes test of planning and multitasking.

To address this, the researchers introduced the Decomposed Policy Graph Modeling (DPGM) algorithm. It breaks down the complex task into smaller sub-problems, such as allocating tables or serving food, and injects domain knowledge—like prioritizing unhappy customers—into the AI's decision-making. This approach uses a graph structure to model relationships between variables, making it more data-efficient than methods that try to learn everything from scratch. For example, instead of processing all 57 actions at once, DPGM focuses on relevant factors, reducing the training burden.

Results from experiments show that DPGM significantly outperforms baselines. In tests with 274 expert demonstrations, DPGM achieved near-optimal performance, while behavior cloning failed to converge due to insufficient data. Even state-of-the-art methods like Proximal Policy Optimization (PPO) and Generative Adversarial Imitation Learning (GAIL) struggled, with PPO requiring over 30 million steps and still producing unstable policies. GAIL, which uses a discriminator to shape behavior, performed worse than PPO in this environment, highlighting the challenge of sparse, delayed rewards in high-dimensional spaces.

The implications extend beyond gaming to real-world applications like autonomous systems and resource management. For instance, in traffic control, a single wrong decision could lead to congestion, similar to how a misplaced order in DinerDash causes customer loss. This benchmark provides a lightweight, accessible tool for developers to test AI robustness without the high computational costs of existing simulators like StarCraft.

However, the study notes limitations: DPGM relies on structured domain knowledge, which may not generalize to less-organized tasks. It also requires semantic understanding of the state, meaning the AI needs clear definitions of variables to decompose problems effectively. Future work could explore adapting this method to unstructured environments, potentially broadening its use in diverse AI applications.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn