Evolution Strategies Fail to Speed Up AI Training

TL;DR

A new study shows simpler AI methods can't reliably accelerate complex learning tasks, undermining hopes for faster training in robotics and gaming.

Artificial intelligence researchers have long sought ways to make training AI systems faster and more stable, especially for tasks like robotics and video games. A new study from EPFL investigates whether a simpler, gradient-free called Evolution Strategies (ES) could serve as a shortcut to accelerate more advanced deep reinforcement learning (DRL) techniques. , published on arXiv, reveal that while ES shows promise in basic environments, it struggles to scale to complex s, offering limited benefits for pretraining and failing to enhance robustness in demanding scenarios.

The researchers compared ES with DRL algorithms like Deep Q-Networks (DQN) and Proximal Policy Optimization (PPO) across tasks of varying difficulty, including Flappy Bird, Breakout, and MuJoCo environments. They tested two key hypotheses: whether ES trains faster than DRL and whether it can be used as a pretraining step to improve DRL performance. , detailed in the paper, show that ES does not consistently train faster than DRL. In simpler environments like Flappy Bird, ES pretraining helped DQN reach higher rewards faster, but in more complex tasks like Breakout and MuJoCo, it provided minimal or no improvement in training efficiency or stability.

Ology involved evaluating ES and DRL agents from scratch under identical conditions, using shared neural network architectures to facilitate comparison. For discrete action spaces in Flappy Bird and Breakout, a multilayer perceptron with two hidden layers of 64 units was used, while for continuous control in MuJoCo, a fully connected network with four hidden layers of 32 units served as the policy. The ES algorithm, as described in the paper, works by sampling random perturbations to policy parameters, evaluating them in the environment, and updating based on rewards without computing gradients. This approach is naturally parallelizable and avoids backpropagation, but requires complete episode rollouts, which can delay updates in long tasks.

Analysis of , referencing figures from the paper, highlights clear performance differences. In Flappy Bird, Figure 1 shows that ES pretraining accelerated DQN's learning curve, allowing it to reach competitive rewards faster than training from scratch. However, in Breakout, Figure 2 indicates that DQN consistently outperformed ES, achieving mean rewards around 30 with a CNN policy, while ES plateaued at much lower levels, near 1.5 for image-based inputs and around 4 for RAM-based inputs. In MuJoCo environments, Figure 3 demonstrates that PPO converged 20 times faster than ES in HalfCheetah but was unstable in Walker2d and Hopper, whereas ES solved most environments reliably but slowly, failing to fully solve Walker2d. Pretraining PPO with ES did not improve training speed or robustness, as shown in Figure 3a for Hopper.

Of these are significant for AI development, particularly in fields like robotics and gaming where training efficiency and stability are critical. The study suggests that ES may be useful for exploration in low-complexity settings, such as simple games, but combining it with gradient-based s in high-dimensional tasks remains challenging. For everyday readers, this means that hopes for a universal shortcut to faster AI training are tempered by the reality that simpler s don't always scale, potentially slowing progress in applications like autonomous systems or advanced game AI that require robust and efficient learning.

Limitations of the research, as noted in the paper, include architectural and learning dynamics differences between ES and DRL that hinder transfer. For instance, PPO uses separate actor and critic networks, while ES optimizes a single policy, leading to incompatible representations. Additionally, ES performed poorly in complex environments like Breakout, especially with high-dimensional inputs, indicating difficulty in scaling. The paper suggests future work should explore adaptive hybrid approaches that align learned representations or modify architectures to bridge this gap, but for now, ES's role as a pretraining tool appears limited to simpler scenarios.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn