Robots Learn Complex Tasks by Watching Videos

TL;DR

A new AI method lets robots pick up skills from demo videos and build executable plans for complex tasks without any human programming.

Imagine a robot that can watch a video of someone setting a dinner table and then perform the same task autonomously, even if it has never encountered that specific assignment before. This capability moves us closer to robots that can adapt to unpredictable everyday environments, learning new skills on demand rather than relying on pre-programmed instructions. The challenge lies in bridging the gap between observing human demonstrations and generating feasible, executable plans for robots.

Researchers have developed a method that allows robots to learn skills from demonstration videos and use this knowledge to accomplish high-level tasks. The key finding is that robots can automatically generate parametric specifications—formal descriptions of temporal and spatial relationships—from videos, which then serve as a domain theory for automated planning. This approach enables robots to solve vague assignments, like "set up the dinner table," by producing detailed, executable action sequences without extensive human input.

To achieve this, the method uses a graph-based spatial temporal logic (GSTL) to represent knowledge. GSTL captures both spatial information (e.g., objects' positions and relationships) and temporal information (e.g., durations and sequences of actions). The specification mining algorithm processes demo videos frame by frame, detecting objects and their interactions. It constructs simple GSTL terms for each frame, such as noting that a hand is holding a cup or that a fork is to the left of a plate, and then combines these into more complex formulas that describe the entire skill. For example, from a video of table setting, it might learn formulas like "hand grabs cup and places it on the table within a certain time interval.

The results show that this method can generate a domain theory from multiple videos, which includes common-sense knowledge and action primitives. In evaluations using a table-setting scenario, the algorithm learned formulas such as those describing the spatial arrangement of utensils and the temporal sequence of actions like picking up and placing objects. The automated planner, consisting of a proposer and a verifier, then uses this theory to generate ordered actions. The proposer explores possible action sequences in a graph-based search, while the verifier checks feasibility using satisfiability modulo theories (SMT) and Boolean satisfiability (SAT) solvers to ensure plans meet temporal and spatial constraints.

This innovation matters because it addresses a major hurdle in robotics: enabling autonomy in unstructured environments. Service robots in homes or hospitals could learn new tasks quickly by watching videos, reducing the need for costly reprogramming. For instance, a robot could learn to assist with chores or caregiving by observing human demonstrations, making robotics more accessible and adaptable. The method's use of formal logic also ensures that plans are explainable and verifiable, which is crucial for safety and trust in human-robot interactions.

However, the approach has limitations. It assumes reliable object detection and accurate 3D positioning from sensors like stereo cameras, which may not always be available in real-world conditions. The current evaluation focused on a controlled table-setting example, and it remains to be seen how well it scales to more complex, dynamic environments. Additionally, the method depends on pre-defined parthood relations (e.g., knowing that a hand is part of the body), which might require prior knowledge for new domains.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn