AI Learns to Explore Without Sacrificing Rewards

Artificial intelligence systems often get stuck exploiting a few high-reward strategies, but a new method teaches them to explore diverse goals while maintaining strong performance. This breakthrough, detailed in a recent paper by researchers from Google DeepMind, addresses a common limitation in reinforcement learning where AI agents focus narrowly on maximizing returns, potentially missing out on valuable alternatives in complex environments like robotics or molecular discovery.

The key finding is that the researchers developed an algorithm that balances two objectives: achieving high cumulative rewards and ensuring the AI visits a wide range of goal states uniformly. Unlike traditional approaches that prioritize either exploration or exploitation, this method optimizes a combined metric that rewards both high returns and dispersed state visits. For instance, in tests, it achieved near-optimal spreading of visits across goal states without compromising on reward, as shown in controlled experiments where it outperformed baselines like Soft Actor-Critic and pseudo-count methods.

Methodologically, the team used an iterative process based on the Frank-Wolfe optimization technique. At each step, the algorithm samples trajectories from the current policy mixture, estimates how frequently states are visited, and computes a custom reward function that encourages exploration of under-visited goals. This reward decreases for states that are frequently encountered, incentivizing the AI to seek out new areas. The approach relies on offline reinforcement learning subroutines, specifically Fitted Q-Iteration, to update policies without requiring prior knowledge of all possible goal states, making it scalable to large systems.

Results from synthetic and benchmark environments, including robotic simulators like reacher, pusher, ant, and half-cheetah, demonstrate the algorithm's effectiveness. In discrete MDP experiments, it achieved a uniform distribution over goal states, whereas methods like Q-learning exploited only a subset. Metrics such as return, partial entropy, and a modified Gini criterion showed that the new method matches the best performers in reward while significantly improving diversity. For example, in one visualization, it spread probability mass evenly across seven goal states, unlike baselines that concentrated on fewer.

The implications are significant for real-world applications where exploring multiple outcomes is crucial, such as in robotics for handling unexpected failures or in drug discovery for identifying various stable molecular structures. By ensuring AI doesn't over-rely on a single strategy, this approach could lead to more robust and adaptable systems in fields like autonomous navigation and scientific research, where enumerating all goals beforehand is impractical.

Limitations noted in the paper include increased computational demands due to multiple calls to reinforcement learning subroutines, similar to other marginal matching techniques. Future work could focus on developing more efficient algorithms to reduce this overhead while maintaining performance guarantees.

AI Learns to Explore Without Sacrificing Rewards

About the Author

Guilherme A.