Why AI Systems Fail at Ambiguous Questions

TL;DR

New research shows why AI struggles with natural language queries about data, and what it means for building smarter, more intuitive systems.

When you ask a question about data, whether to a search engine, a chatbot, or a data analysis tool, you expect a clear answer. But what happens when your question is ambiguous? New research shows that current AI systems are poorly equipped to handle the natural ambiguity in human queries about tabular data, and the problem runs deeper than previously thought. This affects anyone who uses technology to get insights from spreadsheets, databases, or online information.

The key finding is that ambiguity in natural language queries is not a flaw to be fixed but a feature of human communication that systems must learn to manage cooperatively. Researchers developed a framework that distinguishes between cooperative queries—where users intentionally leave room for the system to make reasonable choices—and uncooperative ones that lack enough detail to be resolved. For example, asking "What's the average summer temperature in Copenhagen?" is cooperative because the system can infer that "summer" means June to August and "average" means the mean temperature. But asking simply "What's the temperature?" is uncooperative, as it omits critical details like location and time frame.

The methodology involved analyzing 15 popular datasets used to evaluate AI systems for tabular data tasks, such as question answering and text-to-SQL conversion. The team used large language models to classify over 500 queries from these benchmarks based on data independence and ambiguity. They assessed whether queries contained data-privileged references—like specific column names or internal IDs—that give systems an unrealistic advantage, and whether queries were unambiguous, meaning they had only one valid interpretation.

The results, shown in Figure 2 of the paper, reveal systematic issues: many benchmarks are saturated with data-privileged queries that don't reflect real-world use. For instance, up to 60% of queries in some datasets rely on structural references like column headers (e.g., "EVENTTIME") or value references (e.g., "order #A729-T"), which aren't available in open-domain settings. Additionally, unambiguous queries make up only a small fraction of datasets—as low as 10% in some cases—meaning most benchmarks mix cooperative and uncooperative types, conflating the evaluation of system accuracy with interpretative capabilities.

This matters because it impacts how AI tools are developed and trusted in everyday contexts, from business analytics to personal data queries. If systems are trained on flawed benchmarks, they may perform well in tests but fail with real user questions, leading to frustration and errors. The research suggests that better evaluation should separate tests for unambiguous queries (to measure pure accuracy) from those for cooperative ones (to assess reasonable decision-making), and avoid uncooperative queries altogether since they can't be reliably answered.

Limitations include that the study focuses on single-shot queries without iterative refinement, though real interactions often involve follow-up questions. The framework also assumes systems have access to perfect data, which isn't always practical. Future work should explore how AI can dynamically request clarification when queries are underspecified, balancing automation with user consultation.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn