AI Picks the Best Policy When Environments Change

TL;DR

A new method lets AI systems adapt to unexpected shifts by choosing the most effective pre-trained policy, no costly retraining needed.

Artificial intelligence systems often struggle when their environments change unexpectedly, but a new approach allows them to quickly identify the best performing policies without expensive retraining. This breakthrough could make AI systems more reliable in real-world applications where conditions constantly shift, from autonomous vehicles facing new weather patterns to robots operating in unfamiliar settings.

The researchers developed a method that treats policy selection as a multi-armed bandit problem, where different pre-trained policies represent the "arms" of the bandit. Their modified Upper Confidence Bound algorithm switches between policies to identify the best performer based on accumulated rewards, ensuring the system maintains strong overall performance even as environmental conditions change. This approach works by executing each policy for a set number of steps, observing the rewards, and updating confidence bounds about each policy's expected performance.

In experiments, the team tested their method in two distinct environments. The first was a simple gridworld where agents needed to navigate to high-reward states while avoiding traps. The second, more complex test used the Seaquest Atari game environment, where AI agents received only raw pixel images as input and had to deal with various visual occlusions blocking parts of the screen. In both cases, the algorithm successfully identified the best-performing policies among a set of candidates.

The results show the algorithm achieves logarithmic regret growth, meaning its performance gap from the optimal policy grows slowly over time. In the Seaquest experiments, the system correctly favored policies trained under specific occlusion patterns when those same occlusions appeared during testing. Even when faced with entirely new occlusion patterns not seen during training, the algorithm still selected policies that performed well, often choosing those whose training conditions most closely matched the current situation. When compared against baseline policies specifically trained to be robust to various occlusions, the selection method generally outperformed them within 1,500 iterations.

This work matters because it addresses a fundamental challenge in deploying AI systems: what happens when the real world doesn't match training conditions? Instead of requiring costly retraining or settling for poor performance, this method allows systems to dynamically choose from existing policies to maintain good performance. This could benefit applications like autonomous systems that encounter unexpected sensor failures or environmental changes, where quick adaptation is crucial for safety and effectiveness.

The approach does have limitations. Performance depends heavily on having a diverse set of pre-trained policies available, and having too many policies can slow down the selection process. The method also assumes policies are sufficiently different from each other to provide meaningful choices. Future work will explore finding the right balance between policy diversity and selection speed.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn