AI Fails to Follow Complex Instructions in Chats

TL;DR

A new benchmark shows LLMs often break detailed rules in task-based dialogues, revealing a critical gap for real-world uses like customer service.

In customer service and other task-oriented settings, agents must follow precise instructions to resolve issues, such as handling returns or verifying identities. These instructions often involve complex, multi-step conditions written in natural language, like "If the customer is a gold member, allow unlimited returns; if a guest, only within 30 days." Existing AI benchmarks have simplified such scenarios, but a new study shows that even advanced language models struggle to navigate these real-world complexities, raising concerns for their deployment in critical applications.

The researchers developed TOD-ProcBench, a benchmark based on the ABCD dataset, to evaluate how well large language models (LLMs) follow complex instructions in multi-turn dialogues. They found that across six tested models—including Claude3.7-Sonnet, Qwen3-14B, and Llama3.3-70B—performance was consistently low in tasks requiring instruction retrieval, compliance evaluation, and response generation. For example, in Task 1, which involves predicting the next action and retrieving relevant instructions, the best model, Claude3.7-Sonnet, achieved only 39.48% accuracy for top-1 retrieval and action prediction with English conversations, highlighting significant gaps in understanding fine-grained constraints.

To create TOD-ProcBench, the team derived 55 instruction documents from the ABCD dataset, each corresponding to a user intent like handling mystery fees or return policies. These instructions were formatted in three ways: nested "If-Then" statements, flattened condition-action sequences, and JSON mappings. The benchmark includes 7,695 high-quality conversation-instruction pairs, verified by human annotators with 82% accuracy for instruction relevance. The dataset was expanded to seven languages—Arabic, Chinese, English, French, German, Hindi, and Spanish—by translating conversations while keeping instructions in English, to test multilingual capabilities. The complexity is quantified with an average conditional branching factor of 2.2 and up to four nested levels, simulating real-world scenarios.

Reveal that LLMs perform poorly across all three tasks. In Task 1, accuracy for predicting the next action and retrieving the correct instruction ranged from 14.47% for Llama3.3-70B to 39.48% for Claude3.7-Sonnet with top-1 retrieval in English. Models showed a subtle preference for the nested "If-Then" format (f1) but struggled with overlapping action conditions. In Task 2, which evaluates compliance by detecting instruction-violating responses, models like Claude3.7-Sonnet achieved 76.08% accuracy with English conversations using a direct classification approach, but performance dropped to 56.36% for Llama3.3-70B when requiring instruction entailment checks. Task 3, on generating compliant responses, saw high compliance rates for larger models (e.g., 95.29% for Claude3.7-Sonnet in English) but very low rates for smaller ones like Qwen3-14B at 22.69%, indicating difficulties in conditional generation.

These have important for deploying AI in real-world applications such as customer support, healthcare, and banking, where strict adherence to procedures is crucial. The benchmark shows that current LLMs may not reliably follow complex guidelines, risking errors in sensitive tasks. The study also suggests that English instructions can guide multilingual conversations with minimal performance drop—for instance, Claude3.7-Sonnet maintained 75.58% accuracy in French for Task 2—potentially reducing the need for translated instructions in global settings. However, the overall low accuracy underscores the need for improved training and evaluation s to enhance instruction-following capabilities.

Despite its insights, the study has limitations. TOD-ProcBench is based on a single dataset (ABCD), which may not capture all real-world instruction complexities. The benchmark focuses on 55 intents and 30 actions, limiting generalizability to broader domains. Human verification, while thorough, showed 15% of instructions missing important information, indicating potential gaps in data quality. Additionally, the reliance on LLM judges for compliance scoring in Task 3 may introduce bias, as seen with Claude3.7-Sonnet judging its own generations. Future work could expand to more datasets and codified instruction formats to address these constraints and further advance AI's ability to handle intricate real-world tasks.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn