AI Agents Still Can't Design App Screens Like Humans

TL;DR

A new benchmark shows vision-language models can copy and edit mobile UI screens, but make critical layout and text errors that limit real-world use.

User interface design is a core task in software development, where designers iteratively refine screens using tools like Figma or Sketch. Recent advances in vision-language models (VLMs) suggest these AI systems could assist by directly operating such software through tool invocations, potentially automating repetitive tasks or aiding in fine-grained edits. However, until now, there has been no standardized way to measure how well VLMs perform in this tool-based design environment. A new study introduces CANVAS, the first benchmark to evaluate VLMs on tool-based user interface design, revealing both promising capabilities and significant limitations in current models.

The researchers developed CANVAS to assess VLMs' ability to handle two common design scenarios: replication and modification. In replication tasks, models must recreate an entire UI screen from a reference image, starting from an empty canvas and using tool invocations step-by-step. In modification tasks, models edit an existing design based on specific instructions, such as adjusting a component's color or inserting a new element. The benchmark includes 598 tasks derived from 3,327 human-crafted mobile UI designs across 30 categories like onboarding and messaging, ensuring a diverse and realistic test set. This setup mirrors real-world design workflows, where AI could collaborate with designers within familiar software environments.

To evaluate performance, the team used a multi-turn agentic pipeline where models interact with Figma via 50 predefined tools, such as creating rectangles or setting text properties. They tested five state-of-the-art VLMs: GPT-4o, GPT-4.1, Claude-3.5-Sonnet, Gemini-2.5-Flash, and Gemini-2.5-Pro. Each model operated in cycles of thought, action, and observation, with tasks limited to 50 turns. The evaluation measured similarity between generated and ground-truth designs across four metrics: structural similarity (SSIM) for low-level features, saliency similarity for mid-level patterns, BLIP caption similarity for high-level semantics, and component-wise similarity for attributes like position and color. This hierarchical approach approximates how humans visually process designs.

, Detailed in Table 1 of the paper, show that models achieved moderate success but with notable variations. In replication tasks, Gemini-2.5-Pro scored highest in SSIM (0.774) and saliency similarity (0.630), indicating strength in replicating visual contours and compositions, while GPT-4.1 led in BLIP similarity (0.655) and component-wise similarity (0.716), suggesting better semantic understanding and attribute preservation. In modification tasks, GPT-4.1 outperformed others across all metrics, with scores like SSIM at 0.890 and component-wise similarity at 0.951. However, analysis revealed that high performance in replication correlated with diverse tool use—models like GPT-4.1 and Gemini-2.5-Pro employed more varied tools strategically, such as copying components to save turns. In contrast, modification tasks required precise tool selection; even small errors, like adding a line break, could cause large shifts in similarity scores, as seen in negative Δ values for some models.

Error analysis highlighted common failure patterns that limit practical application. Models frequently struggled with geometric operations, miscounting elements or creating incoherent layouts, as shown in Figure 5 of the paper. They also had difficulty with auto-layout features, where changes to parent components disrupted child elements, pushing items off-screen. Text operations posed another , with models often assigning insufficient space to text components, leading to overflow and broken visual alignment, illustrated in Figure 7. These errors suggest that while VLMs can handle basic design tasks, they lack the nuanced understanding needed for complex, real-world UI design, where small mistakes can ruin usability.

Of this research are significant for the future of AI-assisted design. CANVAS provides a standardized framework to track progress in tool-based UI generation, guiding improvements in model training and evaluation. For designers, it highlights that current VLMs may be useful for automating simple tasks but are not yet reliable for critical design work. The benchmark's human preference study, where GPT-4.1 was preferred over other models, aligns with metric , validating the evaluation approach. However, the study also underscores the need for better s, such as imitation learning, to enhance tool precision and reduce errors in modification tasks.

Despite its contributions, CANVAS has limitations. The benchmark focuses on mobile UI designs from Figma, which may not generalize to other platforms or more complex interfaces. The evaluation relies on automated metrics that, while correlated with human judgment, may not capture all aspects of design quality, such as aesthetic appeal or user experience. Additionally, the study excluded open-source models due to high failure rates in multi-turn tool invocation, limiting the scope of comparison. Future work could expand to include more diverse design types and improve metrics to better reflect real-world design standards. Overall, CANVAS offers a crucial step toward understanding how AI can collaborate in creative processes, but much remains to be done before these models can seamlessly integrate into professional design workflows.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn