Simpler AI Rivals Top Algorithms in Robot Training

TL;DR

A new method matches advanced reinforcement learning techniques, cutting complexity and speeding up AI development for robotics.

Artificial intelligence systems that train robots and virtual agents often rely on complex algorithms, but a new study shows that a simpler method can deliver comparable results. This finding could streamline the development of AI for applications like autonomous navigation and industrial automation, making it more accessible and efficient. Researchers from Ohio State University and Adobe Research have introduced the Proximal Policy Gradient (PPG) algorithm, which builds on foundational policy gradient methods to achieve performance on par with the widely used Proximal Policy Optimization (PPO).

The key discovery is that PPG produces average returns similar to PPO in various robotics environments, as detailed in the paper. For instance, in OpenAI Gym tasks like Ant and HalfCheetah, PPG achieved average rewards of 5527 and 4783, respectively, closely matching PPO's 5583 and 4310. This indicates that the new algorithm can effectively optimize AI policies without the need for more intricate surrogate objectives, relying instead on a partial variation of the vanilla policy gradient approach.

Methodologically, the team developed PPG by modifying the vanilla policy gradient (VPG) objective and incorporating a clipping mechanism in what they term the 'advantage-policy plane.' This strategy adjusts policy updates based on whether actions are advantageous or disadvantageous, using thresholds to prevent excessive changes. Specifically, when the advantage function—a measure of an action's benefit—is positive, probabilities are increased up to a set limit, and when negative, they are decreased similarly. This process ensures stable learning by focusing updates on samples that require adjustment, as illustrated in Figure 1 of the paper, which shows how policy probabilities evolve over iterations.

Analysis of the results reveals that PPG maintains performance while exhibiting lower KL divergence—a metric for policy change—compared to PPO, as seen in Figure 3. For example, in Bullet environments, PPG's KL values stayed consistently lower, indicating more controlled updates. Additionally, entropy, which measures randomness in decision-making, tended to decay in PPG, suggesting it produces more deterministic policies over time. The paper's tables confirm that PPG and PPO yield comparable rewards across multiple runs, with PPG sometimes showing marginal improvements, such as in the Walker2d environment where it achieved an average reward of 3943 versus PPO's 3957.

In practical terms, this research matters because it simplifies AI training for real-world tasks. Reinforcement learning is crucial for developing systems that operate in dynamic environments, from self-driving cars to automated manufacturing. By demonstrating that a straightforward gradient-based method can rival more complex algorithms, the study opens doors for faster and more reliable AI deployments, reducing computational costs and implementation barriers for engineers and researchers.

However, the paper notes limitations, including that PPG's performance can decay in some cases compared to PPO, and its exploration of different clipping strategies was not exhaustive. The authors highlight that further investigations are needed to generalize the approach across diverse scenarios and optimize the clipping parameters, leaving room for future work to enhance robustness and applicability.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn