A new breakthrough in artificial intelligence research demonstrates that large multimodal models (LMMs) can significantly improve their own reasoning capabilities without any human supervision. Researchers have developed a self-evolving framework called EvoLMM that enables AI systems to learn entirely from internal consistency, using only raw images as input. This approach addresses a fundamental limitation in current AI training pipelines, which typically depend on human-curated data or externally verified reward models, restricting autonomy and scalability. The ability for AI to self-improve without external guidance represents a major step toward more autonomous and adaptable intelligent systems.
The core is that LMMs can enhance their mathematical and visual reasoning through a continuous self-rewarding process. The researchers found that by splitting a single backbone model into two cooperative agents—a Proposer that generates image-grounded questions and a Solver that answers them—the system creates its own training signal. This closed-loop framework operates without any annotated data, metadata, or external reward models, relying solely on the degree of agreement among the Solver's multiple answer samples. The continuous reward mechanism provides smooth gradients that enable stable optimization, allowing the model to gradually refine both question generation and reasoning abilities.
Ology involves a purely unsupervised approach where the Proposer generates diverse questions from unlabeled images, and the Solver produces multiple independent answers. The researchers developed a continuous self-consistency reward based on the empirical answer distribution, which scales with the agreement among samples. This replaces discrete majority-vote rewards that proved unstable in early training stages. The Proposer receives an entropy-based reward that encourages questions of moderate difficulty—neither too trivial nor unsolvable—creating an automatic curriculum. Both agents are trained jointly using REINFORCE policy gradients with token-level KL regularization to prevent deviation from the pretrained base model.
Experimental show consistent performance gains across multiple multimodal math-reasoning benchmarks. When applied to the Qwen2.5-VL-7B base model, EvoLMM achieved improvements of approximately 2-3% on datasets including ChartQA (84.00% to 86.70%), MathVista (68.46% to 70.52%), and ScienceQA (88.30% to 89.50%). The framework proved robust across different model backbones, with similar gains observed when applied to InternVL3-8B, Gemma3-12B-It, and Llama-3.2-11B-Vision-Instruct models. Analysis revealed that the Proposer gradually learns to generate more complex visual problems over time, while the Solver develops more structured reasoning chains, demonstrating emergent self-evolution.
Of this research extend to domains where human annotations are unavailable or expensive to obtain. By enabling AI systems to learn directly from raw visual data, this approach could accelerate development in scientific research, medical imaging analysis, and educational applications where labeled datasets are scarce. The continuous reward design provides a more stable learning signal than previous discrete s, potentially preventing model collapse and enabling longer-term self-improvement. This work represents progress toward open-ended, fully autonomous multimodal intelligence that can adapt to new domains without human intervention.
Despite these advances, the framework has limitations acknowledged in the paper. The current implementation focuses specifically on mathematical and visually grounded reasoning tasks, and its effectiveness on other multimodal domains remains untested. The training requires careful parameter tuning, including entropy targets and KL regularization, to maintain stability without external supervision. Additionally, while scales across model sizes—with the 72B variant showing stronger absolute gains—it still depends on the quality of the pretrained base model's multimodal alignment. Future research directions include exploring curriculum emergence, self-generated data scaling, and long-horizon reasoning without supervision.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn