Robots Learn Faster Using AI-Generated Feedback

TL;DR

A new method uses language and vision models to create automatic rewards, boosting robot success rates by up to 56% without manual tuning.

Robots trained to perform complex, multi-step tasks like navigating a house to find and pick up an object often struggle with errors that accumulate over time, limiting their reliability in real-world settings. Traditional s rely on sparse success signals or labor-intensive reward engineering, which can hinder adaptation and efficiency. A new approach called Vision-Language-Long-horizon Reward (VLLR) addresses this by automatically generating dense, informative feedback using large language and vision models, enabling robots to learn more effectively from their interactions.

The researchers found that VLLR significantly improves robot performance on long-horizon tasks, such as those in the CHORES benchmark covering mobile manipulation and navigation. On in-distribution tasks, VLLR achieved up to 56% absolute success rate gains over the pretrained foundation policy and outperformed state-of-the-art reinforcement learning finetuning s by up to 5%. For out-of-distribution tasks, which involve unseen compositions like finding objects based on attributes or affordances, VLLR provided up to 10% higher success rates. These improvements were measured using both task success rate and Success weighted by Episode Length (SEL), a metric that accounts for efficiency by rewarding faster completions.

Combines two key components: an extrinsic reward derived from large language models (LLMs) and vision-language models (VLMs) for task progress recognition, and an intrinsic reward based on policy self-certainty. First, an LLM decomposes high-level task instructions into verifiable subgoals using scene-graph representations of the environment. Then, a VLM estimates progress by evaluating visual observations against these subgoals, providing a coarse-grained signal that reflects task advancement. To avoid prohibitive inference costs, this VLM-derived reward is used only during a brief 200,000-step warm-up phase to initialize the value function. Second, policy self-certainty—a measure of how concentrated the action distribution is—serves as a dense per-step intrinsic reward throughout the Proximal Policy Optimization (PPO) finetuning, guiding local action refinement without manual engineering.

Analysis of , as detailed in Tables I and II of the paper, shows that VLLR enhances both success rates and efficiency across diverse tasks. For example, on the Fetch task, which involves locating and acquiring an object, VLLR achieved a 70.7% success rate with a 63.2 SEL, compared to 65.23% and 54.7 for the sparse reward baseline. Ablation studies in Table III reveal complementary benefits: VLM-based value initialization primarily improves task completion efficiency by encoding high-level structure, while self-certainty primarily boosts success rates, especially on out-of-distribution tasks. The researchers observed that policies trained with VLLR exhibit more decisive action sequences and align better with task decomposition, reducing unnecessary exploration.

This advancement matters because it reduces the need for manual reward design, a major bottleneck in deploying robots for complex, real-world applications like household chores or industrial automation. By leveraging generalizable signals from foundation models, VLLR enables robots to adapt more quickly to new tasks and environments, potentially making them more versatile and reliable assistants. 's reliance on zero-shot capabilities of LLMs and VLMs also scales better than approaches requiring task-specific tuning or extensive labeled data.

However, the study has limitations. Evaluation was conducted solely on the CHORES benchmark, which uses discrete action spaces and procedurally generated houses from ProcThor. Extending VLLR to continuous-action manipulation benchmarks remains an area for future work. Additionally, the VLM progress estimation can be noisy due to hallucinations, requiring correction algorithms to prevent premature saturation of rewards. The researchers also note that assigning overly large weights to the self-certainty reward can destabilize training, necessitating careful balancing with task-level supervision.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn