A new benchmark reveals that even the most advanced AI language models struggle with practical, real-world tasks. The Toolathlon evaluation shows that the best-performing model, Claude-4.5-Sonnet, achieves only a 38.6% success rate when asked to complete realistic workflows like managing student assignments or handling customer service tickets. This performance gap highlights a critical limitation in current AI systems' ability to function effectively outside controlled laboratory settings.
The researchers created Toolathlon to test language agents on 604 diverse tasks spanning everyday applications like Google Calendar and Notion to professional tools including Kubernetes and BigQuery. Unlike previous benchmarks that focused on narrow domains, Toolathlon requires models to coordinate across multiple applications in long, complex sequences. Each task begins with realistic initial states—such as an email inbox already containing messages or a database with existing records—mimicking how humans actually interact with software systems.
Methodology involved connecting AI models to 32 different software tools through the Model Context Protocol (MCP), with most tools sourced from real-world applications. The researchers deployed containerized versions of services like Poste.io for email management and Canvas for course administration to create authentic testing environments. Each task was evaluated using deterministic verification scripts that compared outcomes against ground-truth results, ensuring reliable measurement of success.
Results show a significant performance gap across all tested models. While Claude-4.5-Sonnet led with 38.6% accuracy, other commercial models like GPT-5 and Gemini-2.5-Pro achieved between 20-30% success rates. The best open-source model, DeepSeek-V3.2-Exp, reached only 20.1%. Analysis revealed two major failure patterns: models frequently called non-existent tools or used incorrect tool names (occurring in 20-45% of attempts across different models), and they struggled with lengthy tool outputs that exceeded context limits. Performance consistently decreased as task complexity increased, with models achieving higher success rates on easier tasks requiring fewer tool calls.
This benchmark matters because it demonstrates that current AI agents cannot reliably handle the multi-step workflows that define modern work. In real-world scenarios, professionals regularly switch between applications—checking email, updating databases, generating reports, and coordinating with team members. The poor performance on Toolathlon suggests that AI assistants are not yet ready to automate these complex workflows, despite their proficiency in narrower domains like coding or web browsing.
The study identifies several limitations in current AI systems. Models frequently ended tasks prematurely, claiming completion before all requirements were met. They also struggled with "fuzzy" instructions that resemble actual human requests—concise prompts that require inferring intent rather than following detailed step-by-step directions. Additionally, providing models with access to more tools actually decreased performance, as agents had difficulty identifying relevant tools and ignoring distracting options.
Toolathlon represents a significant advancement in AI evaluation by moving beyond artificial test environments to realistic software ecosystems. As AI systems increasingly promise to automate complex workflows, this benchmark provides the necessary rigor to measure whether they can actually deliver on that promise in practical settings.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn