Artificial intelligence agents powered by large language models are increasingly tasked with complex, multi-step problems, from managing calendars to navigating customer service systems. Yet, these agents often struggle to adapt to new environments, approaching each task from scratch without learning from their accumulated experiences. This limitation not only hampers efficiency but also makes AI systems less reliable in real-world applications where tasks cannot be repeated. A new framework called Experiential Reflective Learning (ERL) addresses this by enabling agents to self-improve through reflection on past successes and failures, offering a path toward more adaptable and consistent AI performance.
The researchers found that ERL significantly enhances agent success rates on challenging benchmarks. On the Gaia2 benchmark, which tests agents in a simulated mobile environment with 12 applications and 101 tools, ERL achieved an overall success rate of 56.1%. This represents a 7.8% improvement over a standard ReAct baseline and outperforms prior experiential learning s like ExpeL and AutoGuide. The gains were consistent across different task types, with an 8.3% increase on Execution tasks and a 7.1% increase on Search tasks. More importantly, ERL improved agent reliability, as measured by passˆ3—the ability to succeed on all three runs of a task—with boosts of 8.3% in Execution and 10.6% in Search, indicating more stable performance across repeated attempts.
Ology behind ERL involves two key components: heuristic generation and retrieval-augmented execution. After each task, the agent reflects on its experience—including the task description, execution trajectory, and outcome—to generate a structured heuristic. This heuristic contains an analysis of what led to success or failure and a learned guideline with specific trigger conditions and recommended actions. For example, a heuristic might advise: "When sending emails to calendar attendees, first resolve names to email addresses via the Contacts tool before calling the email API." These heuristics are stored in a persistent pool. When facing a new task, an LLM scores stored heuristics for relevance based on task similarity, diversity, and informativeness, and injects the top-k (typically 20) into the agent's context to guide execution, as illustrated in Figure 1 of the paper.
Analysis of reveals several critical insights. Heuristics proved more transferable than raw trajectories; few-shot prompting with raw trajectories actually decreased performance by 1.9% compared to the baseline, while heuristics provided distilled strategic principles that generalized better. Retrieval quality was more important than quantity: LLM-based retrieval with 20 heuristics achieved a 56.1% success rate, outperforming both embedding-based retrieval (53.3%) and random selection, which peaked around 40–60 heuristics before degrading, as shown in Figure 4. Additionally, the source of heuristics mattered: failure-derived heuristics excelled on Search tasks by providing negative constraints, while success-derived heuristics worked best on Execution tasks by reinforcing proven actions, though retrieving from both sources offered the most reliable compromise.
Of ERL extend to real-world applications where AI agents must operate in dynamic environments without constant retraining. By enabling parameter-free improvement through experiential learning, ERL could enhance AI systems in areas like customer service, data management, and automated workflows, making them more adaptable and less error-prone. The framework's ability to improve reliability, as evidenced by passˆ3 gains, suggests it could reduce the need for human oversight in repetitive tasks. However, the paper notes limitations, such as a 40% increase in computational costs due to added token usage and s in scaling heuristic pools, including potential conflicts between guidelines. Future work may explore more compact heuristic representations or address coordination in dual-control domains, where user interaction adds unpredictability, as seen in the Telecom domain of τ²-bench where ERL's performance dropped slightly.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn