Artificial intelligence agents that interact with complex environments, such as virtual homes or e-commerce websites, often struggle to navigate efficiently due to a lack of awareness of their surroundings. This limitation, termed structural blindness, leads to wasted actions and poor performance in tasks requiring multiple steps. Researchers from JD.com have developed a new method called Graph-Enhanced Preference Optimization (GEPO) that addresses this by constructing dynamic maps of the environment, enabling agents to identify and prioritize key locations like hallways or product pages. This advancement could improve AI applications in customer service, robotics, and automated workflows, making them more reliable and effective.
The key finding is that GEPO significantly boosts success rates in interactive tasks by overcoming structural blindness. In experiments on benchmarks like ALFWorld, WebShop, and a proprietary Workbench dataset, GEPO achieved improvements of up to +4.1%, +5.3%, and +10.9% in success rates compared to competitive baselines. For instance, on the WebShop benchmark, which simulates online shopping, GEPO increased the success rate from 90.8% to 94.9% for a 7-billion-parameter model, demonstrating its ability to guide agents through noisy, high-dimensional environments more effectively.
Methodologically, GEPO builds a state-transition graph from the agent's experiences, using text-based observations to map out the environment's connectivity. This graph is updated in real-time as the agent explores, with nodes representing states (e.g., rooms or web pages) and edges denoting transitions between them. The researchers employed graph-theoretic measures, particularly betweenness centrality, to identify critical points—such as bottlenecks in paths—that are strategically important. These insights are integrated into the reinforcement learning process through three mechanisms: structured rewards that incentivize visiting high-impact states, topology-aware credit assignment that highlights crucial intermediate actions, and adaptive discounting that adjusts planning horizons based on a state's importance.
Results from the paper show that GEPO not only outperforms existing methods but does so with greater stability and lower variance in training. Ablation studies confirm that each component—intrinsic rewards, advantage aggregation, and dynamic discounting—contributes synergistically; removing any one leads to performance drops of 1-3%, while combined removals cause super-additive declines of up to 7.3%. Additionally, the method scales well with model size, with larger models like the 7-billion-parameter variant showing more pronounced benefits. The use of betweenness centrality proved most effective, outperforming other metrics like eigenvector or closeness centrality by up to 5.8% in success rates, as it better captures the must-pass nature of certain states in tasks like navigating multi-room environments.
In real-world contexts, this research matters because it enhances the efficiency of AI agents in scenarios where sparse feedback is common, such as virtual assistants, automated browsing, or robotic navigation. By reducing redundant exploration and improving decision-making, GEPO could lead to faster task completion and lower computational costs in applications ranging from e-commerce to home automation. However, the paper notes limitations, including increased computational overhead—GEPO is 20-30% more expensive per step due to graph maintenance—and potential issues with scalability in extremely large environments. Future work could focus on incremental graph algorithms for massive state spaces or extending the approach to multi-modal inputs like robotics with visual data.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn