Artificial intelligence systems often face scenarios where they must juggle multiple, sometimes conflicting, goals, such as maximizing efficiency while minimizing costs in a business or optimizing speed and safety in autonomous vehicles. A new study investigates the limitations of widely used techniques in multi-objective reinforcement learning (MORL), a branch of AI that deals with such complex decision-making. The research, conducted by Muhammad Sa’ood Shah and Asad Jeewa, compares different MORL algorithms in discrete environments, highlighting why simpler approaches may struggle in uncertain, real-world settings where the optimal trade-offs between objectives are not always clear.
Key from the paper show that scalarisation functions, which combine multiple objectives into a single value for decision-making, often fail to accurately approximate the full range of optimal solutions, known as the Pareto front. In experiments with environments like Deep Sea Treasure Concave and Four-Room, these functions demonstrated a tendency to favor solutions in specific regions of the solution space, missing others and failing to retain all discovered solutions during learning. For instance, in the Deep Sea Treasure environment, linear scalarisation only found extreme solutions, while Chebyshev scalarisation found more but still left gaps, with both s unable to match the performance of more advanced inner-loop multi-policy algorithms like Pareto Q-Learning, which nearly covered the entire Pareto front.
Ology involved testing selected MORL algorithms, including MO Q-Learning with linear and Chebyshev scalarisation functions, as well as Pareto Q-Learning, across discrete action and observation spaces. The researchers used an outer-loop multi-policy approach for the scalarisation s, running multiple configurations with different weight combinations to approximate the Pareto front, and an inner-loop approach for Pareto Q-Learning, which learns multiple optimal policies simultaneously. Hyperparameters were tuned through parameter sweeps, with metrics like hypervolume, sparsity, cardinality, and inverted generalized distance (IGD) employed to evaluate performance, averaging over 10 trials for reliability.
Analysis, detailed in figures from the paper, revealed significant differences in algorithm performance. In the Deep Sea Treasure Concave environment, Pareto Q-Learning converged faster and found more solutions, with a hypervolume of 801.82 and cardinality of 9, compared to 777.66 and 2 for linear scalarisation and 779.281 and 3 for Chebyshev. The IGD scores were 0.137 for Pareto Q-Learning versus 4.123 and 3.611 for the scalarisation s, indicating closer coverage of the true Pareto front. In the Four-Room environment, where the Pareto front is unknown, linear scalarisation outperformed Chebyshev in hypervolume (71.38 vs. 49) and cardinality (12 vs. 9), but both s converged to different solution sets, suggesting unreliability and bias in their search patterns.
Of these are substantial for real-world AI applications, as they suggest that relying on scalarisation functions could lead to suboptimal decision-making in dynamic environments where objectives conflict or change over time. For example, in areas like robotics, healthcare, or traffic management, AI systems need to adapt to varying user preferences and uncertain conditions. The study argues that inner-loop multi-policy algorithms, which learn multiple solutions in a single run, offer a more sustainable and generalizable approach, potentially enhancing intelligent systems' ability to handle complexity without wasting computational resources on redundant configurations.
Limitations of the research, as noted in the paper, include the focus on discrete environments, leaving continuous and pixel-based spaces unexplored, which could affect the generalizability of . Additionally, the study primarily assessed scalarisation techniques and did not extensively evaluate newer inner-loop algorithms like MORL/D or CAPQL, suggesting areas for future work. The computational infeasibility of scaling weight configurations as objectives increase further underscores the need for more efficient s in multi-objective AI systems.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn