Artificial intelligence systems that learn through self-play, like those mastering complex games, often rely on vast amounts of data. However, not all data is equally useful for training. Researchers have discovered that by manipulating the distribution of experience used during self-play, AI can learn more efficiently, achieving better performance in the early stages of training without increasing computational costs. This insight could accelerate the development of AI for strategic decision-making in real-world applications, from logistics to resource management.
The key finding is that weighting experience based on episode durations—where data from shorter games is upweighted and from longer games is downweighted—improves AI performance. In experiments across fourteen board games, this approach, called Weighting According to Episode Durations (WED), led to stronger playing agents, particularly in the initial 51 to 101 self-play episodes. For example, in evaluations, agents using WED achieved higher win percentages against a diverse pool of opponents, with some reaching 60% to 85% wins early on, compared to standard methods.
Methodologically, the study employed the Expert Iteration (ExIt) framework, where an AI apprentice policy learns to mimic an expert tree search algorithm like Monte-Carlo Tree Search (MCTS). Experience was generated by playing games between identical MCTS agents, with moves selected proportionally to visit counts to ensure diversity. The researchers introduced importance sampling corrections to adjust for biases in the data distribution, ensuring that the AI optimizes for a target where each episode is equally represented, regardless of length.
Results analysis, based on win percentages and α-rank evaluations, showed that WED consistently enhanced performance. Averaged over the fourteen games, WED agents secured more top ranks and higher strategy masses in evolutionary rankings, indicating robust and stable play. For instance, WED alone accounted for 9 out of 28 possible top ranks across games, outperforming other extensions like Prioritized Experience Replay (PER) and Cross-Entropy Exploration (CEE), which had mixed effects. PER, which samples data expected to provide valuable signals, showed minor benefits but was less impactful, while CEE, designed to explore states with large policy mismatches, often degraded performance without proper corrections.
In context, this research matters because it highlights the importance of data sampling strategies in AI training. By optimizing how experience is weighted, systems can learn faster and more effectively, which is crucial for applications beyond games, such as in autonomous systems or personalized AI assistants. The approach retains the efficiency of standard methods while correcting for overrepresentation of long episodes, making it accessible for real-time implementations.
Limitations from the paper note that the improvements are most pronounced in early training stages and vary across individual games. For example, after 200 self-play episodes, the advantage diminishes, and effects are small on average. Additionally, the study focused on board games with perfect information, leaving open how these methods apply to noisy or incomplete information environments. Future work could explore targeted exploration or game-specific patterns to further enhance learning.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn