Robots Learn Complex Tasks in Hours, Not Months

Training robots to perform complex manipulation tasks has traditionally required months of trial-and-error or extensive human demonstrations. A new method called Hi-ORS enables robots to master contact-rich tasks like inserting objects and packing items in just hours of real-world training, achieving 23.3% higher success rates than previous approaches while requiring minimal human intervention.

The researchers discovered that combining outcome-based filtering with intermediate-step supervision creates a stable learning framework for vision-language-action models. Unlike traditional reinforcement learning that relies on inaccurate value estimates, Hi-ORS simply discards negatively rewarded rollouts and retains only successful episodes judged by a golden model. This rejection sampling approach eliminates the overestimation bias that typically plagues robotic learning in high-dimensional action spaces.

The method works through a two-phase process: evaluation and improvement. During evaluation, the robot generates trajectories using its current policy, then applies a reward-based filter to identify successful attempts. Only trajectories that meet or exceed a progressively increasing reward threshold are retained for training. In the improvement phase, the system uses these high-quality samples to update the robot's policy through supervised learning, specifically employing flow matching to provide dense supervision across all intermediate steps of the action generation process.

Results from real-world experiments demonstrate significant performance gains. On the Raise-Hand task, Hi-ORS achieved optimal performance directly reaching the target pose, while on more challenging tasks like Pack-Detergent and Insert-Moisturizer, it reached asymptotic success rates with fewer interactions. The system showed particular strength in test-time scalability, where allowing the robot to retry failed attempts led to monotonic performance improvements, with marginal utility diminishing as compute increased.

What makes this approach particularly valuable for real-world applications is its efficient use of human guidance. The system incorporates varied-frequency human corrections during data collection, where operators can intervene at critical moments to demonstrate recovery behaviors or correct failures. These interventions create rich counterfactual demonstrations that would be nearly impossible to discover through random exploration alone. The method maintains data quality by only retaining positively rewarded episodes, ensuring suboptimal human corrections don't contaminate the training buffer.

The practical implications are substantial for industries relying on robotic automation. Manufacturing facilities could deploy robots that quickly adapt to new assembly tasks, while logistics centers could implement systems that learn complex packing operations without extensive programming. The method's efficiency—achieving 80% success rates within 1.5 hours of training—makes robotic learning feasible in time-sensitive environments where traditional approaches would be impractical.

However, the approach does have limitations. The current implementation focuses on single-task learning, and extending to multi-task or longer-horizon settings remains future work. The reward threshold scheduler may introduce bias toward high-variance outcomes in stochastic environments, and the method's performance depends on the quality of human interventions during the correction phases. Despite these constraints, Hi-ORS represents a significant step toward practical robotic learning that balances autonomy with strategic human guidance.

Robots Learn Complex Tasks in Hours, Not Months

About the Author

Guilherme A.