AI Agents Learn Software Testing on Their Own

TL;DR

A reinforcement learning system automates app testing by learning from past runs, cutting the time needed to verify software like Windows and Edge.

Software testing is a critical but slow and expensive part of development, often requiring hours of manual effort to ensure programs work correctly. Researchers at Microsoft have developed an AI system called DRIFT that automates this process using reinforcement learning, allowing it to efficiently test software functionalities without human intervention. This breakthrough could accelerate software updates and improve reliability for everyday users.

DRIFT learns to interact with software through a symbolic representation of the user interface, known as a UITree, which describes elements like buttons and menus. By treating software testing as a decision-making problem, the AI agent selects actions—such as clicking icons—to achieve specific goals, like adding a website to favorites in the Edge browser. The system uses a graph neural network to understand the interface structure and predict the best steps, training on historical data from previous test runs to avoid the need for real-time interaction with slow simulators.

In experiments on Windows 10, DRIFT consistently outperformed baseline methods. For example, in navigating to the notifications panel in System Settings, DRIFT completed the task in just two steps, whereas random agents took an average of 443 steps and a hash-based method failed entirely. The system also handles multiple testing objectives simultaneously; in one test, it managed to complete tasks like navigating panels and adding devices, though with varying efficiency depending on the setup. A key feature is the ability to balance speed and coverage: by adjusting a temperature parameter, DRIFT can prioritize quick task completion or explore more interface elements to uncover potential issues.

This approach matters because it reduces the time and cost of software testing, which is essential for rapid updates in systems like operating systems and browsers. For regular users, it means fewer bugs and more stable software releases. However, the paper notes limitations, such as DRIFT's reliance on the entire interaction history rather than memory, which may hinder its ability to handle more complex, multi-step tests in the future.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn