AI Struggles to Read Surveys Accurately

Millions of surveys are completed daily, from market polls to patient feedback, but artificial intelligence often fails to interpret them correctly. A new study reveals that how survey data is formatted can cause AI errors of up to 25%, limiting its use in critical areas like healthcare and policy. This research provides the first benchmark to systematically test AI's ability to handle common survey tasks, offering practical guidance to improve accuracy.

Researchers discovered that large language models (LLMs) perform inconsistently on basic survey analysis tasks, such as looking up answers or counting respondents, depending on how the data is presented. For example, using Turtle (TTL) format boosted lookup accuracy by 8.8% compared to suboptimal formats, while poor choices could degrade performance by 16–24%. The study also found that one-shot prompting, where the AI sees a worked example, significantly outperforms zero-shot approaches, reducing errors by up to 25% in multi-step reasoning tasks.

The team developed the QASU benchmark to evaluate six core skills: answer lookup, respondent counting, reverse lookup, conceptual aggregation, rule-based querying, and multi-hop relational inference. They tested various data serialization formats—JSON, HTML, XML, Markdown, Turtle, and plain text—on identical survey datasets from health, psychology, and software engineering. Experiments involved models like GPT-5-mini and Gemini-2.5-Flash, with inputs designed to fit token limits and avoid data memorization through anonymization techniques like rank swapping for numeric values and controlled perturbation for multiple-choice answers.

Results showed that no single format works best across all tasks; for instance, plain text excelled in conceptual aggregation, while structured formats like JSON aided in counting and lookup. Adding lightweight structural hints through self-augmented prompting—where the AI generates its own descriptions of the data—further improved accuracy by 3–4% on average, particularly in reverse lookups and healthcare datasets. This method helps the model better navigate complex survey elements like mixed data types and skip logic without requiring manual intervention.

These findings matter because surveys underpin decisions in medicine, market research, and public policy, where inaccuracies can lead to flawed insights. By identifying optimal formatting and prompting strategies, the study enables more reliable AI integration, such as automating patient feedback analysis in hospitals or consumer trend summarization. However, the research has limitations: it excludes handling missing data, focuses on English-language Western surveys, and assumes clean, complete datasets, which may not reflect real-world messiness. Future work should address cultural diversity and missing-data challenges to broaden applicability.

AI Struggles to Read Surveys Accurately

About the Author

Guilherme A.