Vision-language models struggle with everyday computer interactions that humans find simple, revealing fundamental gaps in artificial intelligence's understanding of graphical interfaces. A new study shows these AI systems can identify buttons and icons but fail to predict what happens when you click them or determine whether tasks have been completed successfully.
Researchers discovered current AI models excel at basic interface perception but perform poorly at interaction prediction and instruction understanding. The study evaluated models across six operating systems—Windows, MacOS, Linux, Android, iOS, and web applications—testing their ability to understand GUI elements, predict action outcomes, and interpret task completion. While models achieved up to 87% accuracy in identifying interface components, their performance dropped to as low as 17% when predicting what would happen after user interactions.
The research team created GUI-Knowledge-Bench, a comprehensive benchmark comprising 3,483 knowledge-centric questions derived from 40,000 screenshots across 400 applications. They systematically analyzed failure patterns across three knowledge dimensions: interface perception (recognizing widgets and states), interaction prediction (anticipating action outcomes), and instruction understanding (judging task completion). The benchmark revealed that while models can identify what interface elements are, they struggle with understanding how those elements behave and change.
Results showed consistent patterns across all tested models, including Claude-Sonnet-4, GPT-5, Gemini-2.5-Pro, and various open-source alternatives. In interface perception tasks, models achieved strong performance, with GPT-5 reaching 87.10% accuracy in layout understanding. However, in interaction prediction, the same model dropped to 67.56% accuracy, frequently confusing single clicks with double-clicks or right-clicks. For instruction understanding, where models must determine whether a task has been completed, performance varied widely, with some models achieving only 31.18% accuracy.
The practical implications are significant for anyone relying on AI for computer automation. Models repeatedly failed at tasks humans consider straightforward, such as adding notes to presentations or formatting tables correctly. In one example, when asked to add a note in PowerPoint, models repeatedly attempted to insert comments into text boxes rather than enabling the Notes pane through the View menu. Another test showed models couldn't reliably judge whether an application had been successfully removed from a computer's dock.
These limitations matter because graphical interface automation represents a growing application area for AI, from booking flights to editing documents. The study's real-world validation using OSWorld environments confirmed that knowledge gaps directly impact task success rates. When models lacked specific application knowledge, their performance dropped significantly, even with multiple attempts.
The research identifies several key limitations in current AI approaches. Models tend to rely on superficial layout cues rather than understanding the underlying functionality of interface elements. They also struggle with application-specific knowledge that humans acquire through experience, such as knowing that comma-separated values require specific delimiter settings in spreadsheet applications.
This work provides a diagnostic tool for evaluating and improving AI systems, highlighting the need for richer knowledge representation in vision-language models. The findings suggest that current training methods—primarily supervised fine-tuning and reinforcement learning—may not sufficiently address the fundamental knowledge gaps preventing AI from achieving human-level performance in graphical interface tasks.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn