AI Models Fail to Predict Smells From Molecular Structure

TL;DR

A new benchmark shows LLMs rely on chemical names, not molecular data, to predict odors, exposing a key gap in AI sensory understanding.

Large language models, the AI systems behind chatbots and coding assistants, are now being tested on a surprising new frontier: the sense of smell. Researchers have introduced the Olfactory Perception (OP) benchmark, a comprehensive evaluation that assesses whether these models can reason about odors from chemical information. The benchmark includes 1,010 questions across eight categories, such as identifying whether a molecule has a smell, predicting odor descriptors like 'floral' or 'fruity,' judging intensity and pleasantness, comparing mixtures, and even determining which olfactory receptors a molecule activates. This work, detailed in a paper from Yale University and other institutions, aims to bridge the gap between AI's text-based knowledge and the complex world of sensory perception, with for fields like fragrance design and environmental monitoring.

The key finding from the study is that current AI models show emerging but limited capabilities in olfactory reasoning. The best-performing model, Claude Opus 4.6 with maximum reasoning settings, achieved an overall accuracy of 64.4% when using compound names as prompts. However, performance varied widely across tasks: simple tasks like odor classification reached up to 92% accuracy, while harder tasks like predicting odor similarity of mixtures or olfactory receptor activation scored as low as 35% and 52.8%, respectively. A striking pattern emerged when comparing different input formats: models consistently performed better with compound names than with isomeric SMILES, a standard notation for molecular structure. The improvement ranged from 2.4 to 18.9 percentage points, with an average gain of about 7 points, suggesting that AI relies more on lexical associations from training data than on genuine structural reasoning about molecules.

Ology involved evaluating 21 model configurations from major providers like OpenAI, Google, Anthropic, Meta, xAI, and DeepSeek. Each question in the benchmark was presented in two formats: one using isomeric SMILES strings and another using common compound names, allowing researchers to isolate the effect of molecular representation. The tasks were grounded in established olfactory science datasets, ensuring objective ground-truth answers. For example, odor classification data came from a curated set of molecules with molecular weight constraints, while odor descriptor tasks used resources like the IFRA fragrance ingredient glossary. The researchers also conducted a multilingual evaluation, translating a subset of tasks into 21 languages to test whether olfactory knowledge is language-specific, finding that aggregating predictions across languages improved performance, with an AUROC of 0.86 for the best ensemble model.

Analysis of reveals significant gaps in AI's olfactory understanding. Models excelled at tasks involving well-known compounds or food sources, such as the smell identification test where they reached up to 80% accuracy in identifying odors like melon from molecular mixtures. However, they struggled with tasks requiring deeper molecular reasoning. For instance, in predicting odor similarity of mixtures, models often used molecular overlap as a proxy for perceptual similarity, leading to near-zero accuracy when similar mixtures shared few molecules, despite human perception recognizing them as strongly similar. In multilingual tests, English and related languages like French and Spanish performed best, while non-Indo-European languages like Korean and Chinese showed lower scores, though cross-lingual ensembles boosted performance, indicating complementary knowledge across languages.

Of this research are both scientific and practical. From a scientific perspective, it highlights how AI models encode chemical knowledge—primarily through text-based associations rather than structural understanding. This has real-world applications: AI capable of accurate olfactory reasoning could accelerate fragrance and flavor development, help identify malodorous contaminants, and improve environmental monitoring. However, the study also identifies limitations. The benchmark uses discrete response formats and standardized descriptors, which may not capture the full richness of human odor perception, including cultural and individual variations. Additionally, safety alignment in some models, like Claude Opus 4.6, led to refusals on questions about hazardous compounds, illustrating a tension between safety and scientific evaluation. Future work could expand the benchmark to include more mixtures, concentration effects, and multimodal inputs, while developing s to encourage true structural reasoning in AI systems.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn