AI Benchmarks Are Failing Science, New Research Shows

TL;DR

Standard AI leaderboards don't reflect real-world performance. Researchers explain why and how we should measure AI progress better.

In the relentless pursuit of artificial intelligence that can truly understand the world, the computer vision community has long relied on a single, dominant yardstick: performance on ImageNet-1K. This benchmark, built from a million web-scraped photos, has for over a decade dictated which models are considered state-of-the-art, guiding billions in research and development. However, a groundbreaking new study reveals a critical and widening chasm between success on this general-purpose leaderboard and a model's actual utility in real-world scientific applications. The research, led by Samuel Stevens at The Ohio State University and presented at NeurIPS 2025, systematically demonstrates that ImageNet accuracy has become a dangerously unreliable proxy for performance in fields like ecology, where AI is increasingly tasked with monitoring biodiversity and understanding complex ecosystems. This s a foundational assumption in machine learning and signals an urgent need for a new generation of evaluation frameworks grounded in the messy, nuanced reality of scientific inquiry.

The study's ology is both comprehensive and revealing. The researchers evaluated 46 modern vision transformer checkpoints—spanning supervised, self-supervised, and image-text pre-training objectives—across three publicly released ecology tasks: long-tail plant species identification from herbarium specimens, animal species classification from camera-trap imagery, and individual re-identification of beluga whales. By calculating the Spearman's rank correlation coefficient between each model's ImageNet-1K top-1 accuracy and its performance on these scientific tasks, they uncovered a stark 'ranking cliff.' For models that have surpassed the now-common 75% accuracy threshold on ImageNet, the correlation between ImageNet ranking and ecological task ranking plummets below 0.25. In practical terms, this means that among high-performing models on the general benchmark, ImageNet accuracy explains less than a quarter of the variance in how they will rank on critical scientific work, rendering the classic leaderboard nearly useless for guiding model selection in these domains.

The core finding is quantified with striking clarity. Across all 46 checkpoints, ImageNet-1K top-1 accuracy explains only 34% of the variance in performance on the new, consolidated ecological benchmark introduced by the paper, called BioBench. The rank concordance between ImageNet and BioBench is a modest ρ = 0.55 overall, meaning the model preferred by ImageNet is actually worse on the scientific benchmark roughly 22% of the time. This mis-ranking problem intensifies at the frontier of model development: among models above 75% on ImageNet, the rank concordance drops further to ρ = 0.42, indicating the supposed 'best' general model is mis-ranked a staggering 30% of the time when evaluated on ecological tasks. The paper attributes this failure to two intertwined reasons: a fundamental distribution mismatch between curated web photos and scientific imagery from drones, microscopes, and camera traps, and the fine-grained, long-tailed nature of scientific classification, which involves distinguishing thousands of species with subtle visual differences—a absent from ImageNet's 1,000 broad, common object classes.

To address this evaluative crisis, the researchers built and open-sourced BioBench, a unified, application-driven vision benchmark specifically for ecology. BioBench consolidates nine public tasks spanning four taxonomic kingdoms (animals, plants, fungi, protists) and six distinct image acquisition modalities, including drone RGB video frames, web video, micrographs, in-situ photos, specimen photos, and camera-trap frames, totaling 3.1 million images. The benchmark is designed for minimal integration overhead, requiring models to implement only a simple frozen embedding function. It uses a uniform linear probing protocol to report class-balanced macro-F1 scores, with domain-specific metrics for tasks like FishNet and FungiCLEF. Remarkably, evaluating a ViT-L model across the entire suite takes just about six hours on a single NVIDIA A6000 GPU. In , models like CLIP, SigLIP, and SigLIP 2 set new state-of-the-art scores on BioBench, despite not always leading on ImageNet, further underscoring the benchmark's ability to reveal different, more application-relevant capabilities.

Of this work are profound, extending far beyond computational ecology. It provides a concrete, reproducible template for building 'grounded' benchmarks in any data-rich scientific domain, from medicine to manufacturing, where distribution shift and long-tail class structures are the norm. The study forcefully argues that the era of relying on web-photo leaderboards as a universal proxy for AI progress in science is over. For conservationists, ecologists, and developers building AI for environmental monitoring, BioBench offers the first systematic tool to select models based on their actual performance across the multifaceted s of real-world workflows. It represents a pivotal shift from proxy-driven evaluation to mission-driven assessment, ensuring that advances in AI architecture translate directly into utility for the pressing scientific and environmental questions they are meant to address.

Despite its significance, the authors acknowledge several limitations. BioBench's scope is currently limited to ecology; domains like medical imaging or industrial inspection emphasize different tasks such as detection, segmentation, or calibration, which would require tailored benchmarks. The evaluation protocol relies on frozen features and linear probing, which, while excellent for isolating representation quality, may underestimate the performance gains possible with task-specific fine-tuning. Furthermore, the primary metric of macro-F1, chosen to reward performance on long-tail classes, may not align with all application needs, where metrics like precision-at-a-specific-recall could be more operationally relevant. Nonetheless, BioBench stands as a powerful proof-of-concept, demonstrating unequivocally that ImageNet-driven model choice is unreliable for scientific imagery and providing a clear, minimal recipe for evaluating AI where it truly matters.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn