AI Evaluates Itself Without Costly Testing

Evaluating large language models (LLMs) like GPT-4 is notoriously expensive, requiring massive computational resources and time. This challenge slows innovation and limits access for researchers with constrained budgets. A new study introduces a method that drastically cuts these costs by selecting small, representative subsets of test data, enabling accurate model assessment with minimal effort.

The researchers found that their item-centric approach, called Scales++, focuses on the intrinsic cognitive demands of test items—such as logical reasoning or knowledge in specific domains—rather than relying on past model performance. This allows the selection of a tiny subset that preserves the predictive fidelity of the full benchmark. For example, on the Open LLM Leaderboard, which includes over 28,000 test items, Scales++ achieved a mean absolute error of just 2.9% using only 0.5% of the data, matching the accuracy of full-scale evaluations.

Methodologically, the team annotated test items using a pre-defined rubric to score 16 cognitive dimensions, such as attention scan and quantitative reasoning, based on frameworks from prior work. These annotations created embeddings for each item, which were then clustered to identify a diverse subset. To further reduce costs, they developed Scales++ Lite, a graph neural network (GNN) predictor that estimates cognitive demands without expensive API calls, cutting annotation time to under 20 minutes for large datasets.

Results analysis shows that Scales++ outperforms traditional model-centric methods, which require extensive historical data and upfront computations. In experiments, it reduced initial evaluation costs by over 18 times while maintaining low error rates across different model architectures and sizes, including large models with 65 billion parameters. The method's robustness was validated on tasks like mathematical reasoning and commonsense QA, where it consistently provided reliable performance estimates even with minimal data.

In practical terms, this innovation makes AI evaluation more accessible, allowing smaller teams and organizations to benchmark models efficiently. It could accelerate AI development by enabling rapid testing during model training and deployment, without sacrificing accuracy. For instance, evaluating a 70-billion-parameter model could take hours instead of days, democratizing high-quality assessment in fields like education and research.

Limitations include potential dataset-specific effects, as noted in the paper's analysis where performance varied slightly across task categories like mathematics. Additionally, the method assumes that cognitive annotations capture all relevant item properties, which may not hold for highly specialized or novel benchmarks. Future work could explore adapting the approach to emerging AI tasks without extensive recalibration.

AI Evaluates Itself Without Costly Testing

About the Author

Guilherme A.