AI Agents Fail Most Real Office Tasks, Study Finds

TL;DR

Top AI models flunk over half of simulated enterprise tasks, exposing a major gap in workplace automation and the push for more reliable systems.

Large language models (LLMs) are increasingly used in businesses to automate tasks like customer support and data analysis, but their effectiveness in complex, real-world office environments remains uncertain. A new study introduces EnterpriseBench, a comprehensive simulation of enterprise settings, to evaluate how well AI agents handle everyday workplace challenges. The researchers found that even the most advanced AI models complete only 41.8% of tasks successfully, indicating significant limitations in current technology for business automation.

The key finding centers on the performance of five state-of-the-art models—GPT-4o, Claude-3.5-Sonnet, o1-mini, Llama-3.1-8B, and Llama-3.3-70B—tested across 500 tasks in domains such as software engineering, human resources, and finance. These tasks mimic real office scenarios, like creating a GitHub repository and notifying a manager, which require multi-step reasoning and adherence to access controls. The o1-mini model performed best but still achieved low success rates, with errors including wrong tool selection and incomplete task decomposition.

Methodology involved developing EnterpriseBench, a sandbox environment that simulates a company with fragmented data sources, role-based access controls, and cross-functional workflows. The researchers created this benchmark using synthetic data from public sources, applying rule-based techniques to ensure realism. They designed tasks that require AI agents to perform search, create-read-update-delete (CRUD), and other operations, with evaluations based on correctness scores from automated and human assessments.

Results analysis, detailed in figures from the paper, show that models using ReAct-style reasoning outperformed those without planning, but all struggled with tasks involving multiple steps. For instance, in one test, an agent had to execute a workflow with four sub-steps, but failures often occurred due to hallucinations or incorrect tool usage. Human agents, by comparison, achieved 70% accuracy but took much longer, highlighting a trade-off between speed and precision in AI systems.

The context of these findings matters because enterprises rely on AI for efficiency gains, yet this study reveals that current agents cannot reliably handle routine office duties without errors. This could slow adoption in sectors like IT and HR, where accuracy is crucial for tasks such as managing employee records or processing support tickets. The research underscores the need for improved AI systems that better understand organizational hierarchies and data privacy.

Limitations of the work include the reliance on synthetic data, which may not fully capture real-world complexities, and the need for human validation in task generation, adding cost and time. The paper notes that errors in AI performance persist, suggesting areas for future improvement, such as enhancing planning algorithms and grounding mechanisms to reduce hallucinations.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn