AI Wins Gold in Chemistry by Matching Top Human Experts

TL;DR

A new AI system solves complex chemistry problems at gold-medal level, using specialized agents and visual tools to push scientific AI forward.

Artificial intelligence has reached a new milestone in scientific reasoning, with a system that can tackle chemistry problems at the level of the International Chemistry Olympiad, a competition for the world's brightest high school students. Researchers have developed an AI framework that scores 93.6 out of 100 on a benchmark derived from the 2025 IChO theoretical exam, surpassing an estimated human gold medal threshold based on past performance data. This achievement marks a significant step forward in AI's ability to handle the dense, symbolic visual language of chemistry, which integrates molecular structures, reaction schemes, and quantitative calculations in ways that have previously stumped even advanced models.

The key finding from this research is that combining structured visual guidance with a multi-agent system enables AI to perform at a gold-medal level in chemistry problem-solving. The system, called ChemLabs, uses a hierarchical team of specialized agents that mimic human expert collaboration, breaking down complex problems into manageable sub-tasks. When augmented with Structured Visual Enhancement (SVE), which provides machine-readable descriptions of visual elements like molecular diagrams, the AI's performance jumps dramatically. For example, with the Gemini-2.5 Pro model, the score increased from 70.6 in a baseline configuration to 93.6 with both SVE and the multi-agent system, as shown in Table 2 of the paper. This score exceeds the estimated gold medal cutoff of about 75 points derived from 2021 IChO statistics, indicating capabilities comparable to the top percentile of human contestants.

Ology behind this breakthrough involves two main innovations: the ChemO benchmark and the ChemLabs framework. ChemO reformulates IChO 2025 problems using Assessment-Equivalent Reformulation (AER), which converts tasks requiring visual outputs, such as drawing molecules, into formats like SMILES strings that are easier for AI to process and evaluate. This allows models to leverage their text-generation strengths without needing to produce precise drawings. Additionally, SVE provides structured textual descriptions of visual content, helping to isolate whether performance limitations stem from poor visual perception or weaker chemical reasoning. ChemLabs then employs a manager agent to decompose problems, dispatching sub-tasks to specialized modules for perception, solving, and auditing, with iterative refinement to ensure accuracy.

From the experiments demonstrate clear performance gains across multiple state-of-the-art models. As detailed in Table 2, Gemini-2.5 Pro achieved the highest score of 93.6 with SVE and the multi-agent system, followed closely by Claude-3.7 Sonnet at 93.2, GPT-o3 at 89.2, and Qwen3-VL at 78.3. The data shows that SVE alone provided substantial improvements, particularly on visually intensive problems like P3, where Gemini-2.5 Pro's normalized score rose from 13.3 to 19.0. The multi-agent system further enhanced , especially on complex, multi-step problems such as P3 and P7, by reducing errors through specialized solvers and verification steps. The LLM-as-a-Judge similarity scores, which measure semantic alignment with reference solutions, generally correlated with rubric-based scores, confirming that the improvements reflect genuine gains in correctness.

Of this research extend beyond academic benchmarks, offering a pathway for AI to assist in real-world chemistry tasks that require deep multimodal reasoning. By addressing the visual perception bottleneck, the approach could enable AI to help scientists analyze complex diagrams, interpret spectral data, or design new molecules more efficiently. The multi-agent framework also provides a blueprint for tackling other scientific domains where collaboration and verification are crucial, such as physics or biology. However, the authors caution that direct claims of medal achievement are limited by variations in exam difficulty and context, and the system's performance relies on structured visual inputs that may not always be available in practical scenarios.

Limitations of the study include the reliance on the 2025 IChO exam, which, while chosen to minimize data contamination, still requires assumptions about human performance baselines from earlier years. The paper notes that estimated medal thresholds are derived from 2021 statistics and should be interpreted with caution due to annual differences in exam difficulty. Additionally, the SVE component depends on tool-generated encodings of visual content, which may not capture all nuances of original diagrams, and the multi-agent system's effectiveness varies with the underlying model's capabilities, as seen with weaker models like Qwen3-VL achieving lower absolute scores despite relative improvements. Future work could explore adapting these s to more diverse or real-time chemistry applications without structured guidance.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn