AI Learns Better from Mixed Data When Rewards Are Scarce

TL;DR

A new study finds transformer AI models beat traditional methods in sparse reward settings, with more stable results but higher computing costs.

A new comparative study in artificial intelligence research has uncovered distinct performance patterns between different approaches to offline reinforcement learning, particularly in how they handle varying levels of feedback. The research, conducted in the ANT environment—a challenging quadrupedal locomotion task from the MuJoCo physics engine—systematically evaluated Decision Transformers against traditional offline RL algorithms like Conservative Q-Learning and Implicit Q-Learning across both dense and sparse reward settings. This investigation addresses a critical gap in understanding how these algorithms adapt to different reward structures, which is essential for real-world applications where feedback may be infrequent or costly to obtain, such as in robotics or healthcare.

The researchers found that Decision Transformers showed less sensitivity to varying reward density compared to traditional s and particularly excelled with medium-expert datasets in sparse reward scenarios. In sparse reward settings with the ant-medium-expert-v2 dataset, DT achieved a normalized score of 120.6 ± 1.1, significantly outperforming CQL's 103.38 ± 15.51 and IQL's 85.95 ± 21.18. This indicates DT's particular aptitude for leveraging mixed-quality data when rewards are infrequent. In contrast, traditional value-based s like IQL showed improved performance in dense reward settings with high-quality data, while CQL offered balanced performance across different data qualities.

Ology involved testing DT, CQL, and IQL models using the CORL implementation of offline RL models across four distinct dataset types from D4RL: medium, medium-replay, medium-expert, and expert. These datasets reflected different data collection strategies and policy performance levels, with the medium dataset having rewards around 4000 and the expert dataset closer to 6000 in the ANT environment. The researchers conducted experiments in both dense reward settings, where the agent received feedback at each time step using the default D4RL reward structure, and sparse reward settings, where rewards were assigned only to the top 25% of trajectories if the cumulative dense reward surpassed a threshold set at the 75-percentile of all returns in the dataset. Each experiment was run with four different random seeds, and were averaged to ensure statistical significance.

Analysis of , as shown in Table 1 of the paper, revealed that DT maintained lower variance across both reward structures, exhibiting relatively stable scores between sparse and dense settings. For instance, in the ant-medium-v2 dataset under sparse rewards, DT scored 87.9 ± 3.4, while CQL scored 91.55 ± 8.56 and IQL scored 84.49 ± 11.38, indicating DT's more consistent performance. Figure 3 demonstrated that DT exhibited the most stable learning trajectory, maintaining consistent performance between 80-90 normalized score across all four random seeds on the ANT medium with minimal fluctuations. In dense reward settings, IQL demonstrated superior performance in some cases, such as in the ant-medium-expert-v2 dataset where it scored 124.2 ± 5.8 compared to DT's 90.24 ± 3.39 and CQL's 107.0 ± 21.2. The study also found that all algorithms could exceed expert-level performance (scores above 100) in certain configurations, particularly with higher-quality datasets, as algorithms discovered optimizations beyond the original expert demonstrations.

Of these are significant for practical applications of offline reinforcement learning. The consistency of Decision Transformers in sparse reward settings suggests they may be more suitable for scenarios with uncertain reward structures or mixed-quality data, such as in robotics where feedback is limited or in healthcare where data collection is costly. However, the trade-off is computational: DT required an average training time of 7.5 hours over all runs, compared to 5 hours for CQL and 2 hours for IQL, as shown in Table 2. This presents an important consideration for practitioners who must balance performance with resource constraints. The study's emphasis on offline RL addresses themes of sustainable development by potentially reducing the need for costly online interactions, making these algorithms more accessible for real-world use.

Despite these insights, the paper notes limitations and generalizability concerns. are specific to the ANT environment, and generalization to other continuous control tasks or different domains would require additional investigation. The performance characteristics observed may be influenced by the specific dynamics and complexity of the ANT environment, and different patterns might emerge in other contexts. The researchers also highlight that denser reward structures are not universally superior, as over-engineered reward functions can lead to unintended behaviors or brittle policies. Future research should scale these experiments across additional environments and benchmarks to verify whether the observed patterns hold beyond this specific task, and explore real-world datasets to assess how these algorithms handle practical constraints and uncertainties.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn