AIResearch AIResearch
Back to articles
AI

AI Models Improve by Thinking Before They Act

A new method makes AI systems generate explanations before creating data representations, boosting accuracy in image and text retrieval by nearly 5%.

AI Research
March 26, 2026
4 min read
AI Models Improve by Thinking Before They Act

Artificial intelligence systems that can process both images and text, known as multimodal large language models (MLLMs), are becoming essential for tasks like searching for pictures based on descriptions or answering questions about visual content. Traditionally, these models create data representations, called embeddings, by directly encoding inputs without any internal reasoning process. However, researchers have discovered that by making these models 'think' through a problem first—generating a step-by-step explanation or rationale—they can produce significantly better embeddings. This approach, detailed in a new study, improves retrieval performance by 4.9% on a standard benchmark, showing that explicit reasoning isn't just for generating answers but can enhance the very foundations of how AI understands and represents information.

The key finding from the research is that MLLMs, when trained to generate rationales before extracting embeddings, create more informative and discriminative representations. In the study, the researchers developed a called Reasoning Guided Embeddings (RGE), which forces the model to produce a structured explanation based on the input query before forming the final embedding. For example, in a task where the query asks to find an image of 'a human and an animal from a different species' based on a picture of manta rays, a baseline model without reasoning might incorrectly retrieve an image of just animals. In contrast, the RGE model generates a rationale like 'look for an underwater scene featuring a diver interacting with marine life,' leading to a correct match. This demonstrates that reasoning helps the model capture deeper semantics beyond surface-level features, resulting in more accurate retrievals across diverse tasks like classification, visual question answering, and visual grounding.

To implement this, the researchers designed a joint training framework that combines two objectives: language modeling for rationale generation and contrastive learning for embedding quality. They first curated a dataset of oracle rationales by using a powerful MLLM, InternVL-3.5-38B, to generate explanations for why a query matches its target in the MMEB benchmark. These rationales served as supervised targets during training. However, a critical emerged: if the model used these oracle rationales directly in contrastive learning, it could 'cheat' by aligning the rationale with the target without genuinely understanding the query, a problem termed information leakage. To avoid this, uses self-generated rationales on-the-fly during contrastive training, ensuring the embeddings rely on the model's own reasoning rather than pre-existing answers. The training involves a special token, , whose hidden state is used as the embedding, and the model is fine-tuned with a balanced loss function that weights the language modeling and contrastive objectives.

, As shown in Table 2 of the paper, are compelling. On the MMEB benchmark, which includes 36 datasets across four task types, RGE achieved an overall score of 70.1, outperforming a non-reasoning baseline at 65.2 and many other models at similar scales. For instance, in classification tasks, RGE scored 64.4 compared to the baseline's 59.5; in visual question answering, it reached 67.8 versus 59.9; and in visual grounding, it achieved 90.1 against 88.0. Ablation studies confirmed the importance of reasoning: disabling reasoning at inference time caused a performance drop, though the model still beat the baseline, indicating that training with rationales improves embeddings generally. Additionally, the researchers tested different loss weightings and found that a ratio of 1:10 for language modeling to contrastive loss yielded optimal , as shown in Table 3, with scores peaking at 67.4 for overall performance.

Of this research extend beyond academic benchmarks to real-world applications where accurate multimodal retrieval is crucial. For everyday users, this means AI systems in search engines, virtual assistants, or content recommendation platforms could become more precise in understanding complex queries that involve both images and text. For example, when searching for a specific scene in a video or identifying objects in medical images, reasoning-guided embeddings could reduce errors and improve efficiency. The study also highlights a broader trend in AI: moving beyond treating models as black boxes to incorporating explicit, interpretable steps like rationale generation, which not only boosts performance but may enhance transparency. However, the paper notes limitations, such as the potential for reasoning to introduce hallucinations or irrelevant details, especially when applied to text candidates in large retrieval pools, as observed in experiments on MSCOCO. Future work could explore how to mitigate these issues and extend the approach to other modalities or larger-scale models.

Original Source

Read the complete research paper

View on arXiv

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn