AIResearch AIResearch
Back to articles
Science

AI Training Recipe Boosts Multimodal Reasoning

A new open-source method improves AI's ability to reason across images and text by 11.6%, offering a transparent blueprint for building smarter models.

AI Research
March 26, 2026
3 min read
AI Training Recipe Boosts Multimodal Reasoning

Artificial intelligence systems that can reason about both images and text are becoming increasingly important, but building them has often been a black-box process. Researchers have now developed a fully transparent training that significantly enhances these multimodal reasoning capabilities, achieving an 11.6% improvement over existing approaches. This breakthrough provides a clear, reproducible recipe that could accelerate progress in AI systems that need to understand complex visual and textual information together, from scientific analysis to everyday problem-solving.

The key finding from this research is that a carefully designed two-stage training process can substantially boost AI performance on multimodal reasoning tasks. , called OpenMMReasoner, combines supervised fine-tuning (SFT) with reinforcement learning (RL) to create models that outperform existing state-of-the-art systems across nine different benchmarks. As shown in Figure 1, OpenMMReasoner consistently beats competing s, with particularly strong on mathematical reasoning tasks involving both images and text. The researchers achieved this by focusing on data quality and diversity rather than simply scaling up dataset size, demonstrating that smarter curation strategies can yield better with less data.

Ology begins with constructing a high-quality dataset for the supervised fine-tuning stage. Researchers collected approximately 103,000 raw question-answer pairs from public sources covering general visual question answering and reasoning tasks. They then used a strong teacher model, Qwen3-VL-235B-Instruct, to generate multiple verified reasoning traces for each question, expanding the dataset to 583,000 samples. This approach, detailed in Table 2, shows that using a powerful teacher model significantly improves data quality. The researchers also discovered that increasing answer diversity—generating multiple correct reasoning paths for the same question—boosted performance, as shown in Table 3 where moving from single to eight answers per question improved average benchmark scores from 50.5 to 55.2.

Demonstrate clear advantages across multiple evaluation metrics. After the supervised fine-tuning stage, the model showed strong performance on mathematical reasoning benchmarks, achieving scores of 36.6 on MathVision and 57.7 on MathVerse. The subsequent reinforcement learning phase further enhanced these capabilities, with the final model reaching 79.5 on MathVista and 63.8 on MathVerse, as detailed in Table 6. The reinforcement learning stage used a dataset of 74,000 samples across diverse domains and employed the GSPO algorithm, which proved more stable and efficient than alternatives like DAPO or GRPO, as illustrated in Figure 4. The complete training pipeline, shown in Figure 2, includes both stages with full transparency about data sources and processing steps.

Of this work extend beyond immediate performance improvements. By open-sourcing all components—including data pipelines, datasets, and model weights—the researchers have created a reproducible foundation for future multimodal reasoning research. This transparency addresses a critical gap in the field, where many advanced models are developed without clear documentation of their training processes. also demonstrates efficient reasoning: as shown in Figure 6, OpenMMReasoner achieves better accuracy than competing approaches while using significantly fewer computational resources, making it more practical for real-world applications where efficiency matters.

Despite these advances, the research has limitations that point to future directions. The work primarily focuses on the Qwen2.5-VL-7B-Instruct model family and evaluates performance mainly within the image domain, leaving open questions about extension to other modalities like video or audio. Additionally, while the study explores scaling strategies, it hasn't identified the upper bound of model performance under further scaling. The researchers note that broadening evaluation to more model architectures and exploring generation capabilities across multiple modalities simultaneously would help validate the generality of their across different AI systems and applications.

Original Source

Read the complete research paper

View on arXiv

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn