AI Agents Now Teach Themselves Without Human Data

TL;DR

A new framework trains AI by pitting two agents against each other, cutting the need for costly human datasets and enabling scalable self-learning.

Artificial intelligence systems have long been constrained by their reliance on massive amounts of human-curated data, creating a bottleneck that limits scalability and ties AI capabilities to human knowledge and annotation speed. This dependency is particularly acute for large language model agents trained with reinforcement learning, which require extensive interaction with environments for tasks like deep research and agentic coding. Now, researchers have developed a fully autonomous framework called Agent0 that allows AI agents to evolve from scratch without any external data, using a novel co-evolutionary approach that integrates tool use to break free from these limitations. This breakthrough could pave the way for more scalable and independent AI systems that learn complex reasoning without human intervention.

The key finding from this research is that Agent0 significantly enhances the reasoning capabilities of base language models by establishing a symbiotic competition between two agents initialized from the same base model. Specifically, the framework improves the Qwen3-8B-Base model by 18% on mathematical reasoning benchmarks and 24% on general reasoning benchmarks, as shown in Tables 1 and 2 of the paper. These gains are achieved entirely without human-annotated data, demonstrating that AI can autonomously generate its own training curriculum and solve increasingly complex problems through iterative self-improvement. The system outperforms existing self-evolving s like R-Zero and Absolute Zero, highlighting its effectiveness in driving capability gains beyond what previous approaches could achieve.

Ology behind Agent0 involves a co-evolutionary loop where two functionally distinct agents—a curriculum agent and an executor agent—are initialized from the same base large language model. The curriculum agent is trained using reinforcement learning to generate frontier tasks that the executor agent, with rewards based on the executor's uncertainty and frequency of tool use. Concurrently, the executor agent is trained via reinforcement learning to solve these tasks, using pseudo-labels derived from its own majority voting. This process is enhanced by integrating a code interpreter tool, which allows the executor to execute Python code for problem-solving, creating a virtuous cycle where improved tool use drives the curriculum agent to generate more complex, tool-aware tasks. The framework also supports multi-turn interactions, enabling the generation of context-rich, conversational tasks that better reflect real-world problem-solving dynamics.

Analysis, detailed across ten benchmarks, shows consistent improvement through multiple iterations of the co-evolutionary process. For instance, on Qwen3-8B-Base, the average math score improved from 55.1 in iteration 1 to 58.2 in iteration 3, as illustrated in Figure 4. Table 5 further reveals that the pass rate of the executor agent on tasks generated by the curriculum agent decreases over iterations, indicating increasing task difficulty, while the average number of tool calls per task rises from 1.65 to 2.60. Ablation studies in Table 3 confirm the importance of each component: removing the curriculum agent's training leads to a 9.3% performance drop, while excluding the tool reward in a 7.2% decline. These data points underscore that the co-evolutionary loop and tool integration are critical drivers of the observed performance gains.

Of this research are profound for the future of AI development, as it offers a scalable pathway to train high-performing agents without the time-consuming and costly need for human-curated datasets. By enabling AI to autonomously generate and solve complex problems, Agent0 could accelerate progress in fields requiring advanced reasoning, such as scientific research, coding, and data analysis. The framework's ability to generalize from mathematical reasoning to general-domain tasks, as shown in Table 2, suggests that the skills cultivated through this self-evolution process are transferable, potentially leading to more versatile AI systems. This approach also reduces the dependency on human knowledge, allowing AI to explore problem spaces beyond current human expertise and fostering innovation in autonomous learning.

Despite its successes, the research acknowledges limitations, including the potential for curriculum stagnation if the agents' capabilities are capped by their inherent knowledge, a that tool integration helps mitigate but may not fully resolve. The framework relies on pseudo-labels from majority voting, which can introduce label noise, particularly for high-ambiguity tasks, though this is addressed through ambiguity-aware advantage scaling in the Ambiguity-Dynamic Policy Optimization . Additionally, while Agent0 demonstrates strong performance on benchmarks, its real-world applicability in diverse, unstructured environments remains to be tested, and the computational resources required for the iterative co-evolution process may pose scalability s for some applications. Future work could explore extending the framework to more domains and improving efficiency to broaden its impact.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn