In the rapidly evolving landscape of artificial intelligence, Large Language Models (LLMs) like GPT-4o and Claude 3.5 have become indispensable tools for tasks ranging from coding to creative writing. Yet, a groundbreaking study from East China Normal University reveals a critical flaw in their reasoning abilities that could hinder their deployment in high-stakes domains. The research, detailed in the paper "Improving the Derivation Capability of Large Language Models," introduces the concept of Derivation Capability (DC)—the ability to recognize and apply abstract rules governing how outputs should change when inputs are systematically altered. This capability, which humans effortlessly employ in scenarios like adjusting age calculations or reversing paths in graphs, remains underexplored in AI systems, posing risks for applications in data analysis, legal reasoning, and scientific where consistency under transformation is paramount.
The study's ology centers on the DEVAL framework, a rigorous system designed to evaluate and enhance DC in LLMs. DEVAL formalizes Derivation Relations (DRs) as pairs of transformations: one on the input domain (T) and another on the output domain (R), requiring that when inputs change according to T, outputs must correspondingly change per R. For instance, in a path-finding task, swapping start and end nodes should reverse the path sequence. To quantify performance, the researchers defined the Derivation Capability Score (DCS), calculated by sampling input pairs and measuring how often LLMs adhere to the expected output relations. They applied this to five popular LLMs—GPT-3.5, GPT-4o, Claude 3.5, Qwen, and Kimi—along with the reasoning-focused O1-mini, across seven tasks including logic puzzles, mathematical integrals, and algorithmic problems, using datasets like AQA-Bench to ensure diversity and realism in testing scenarios.
From the evaluation paint a sobering picture of current AI limitations. Mainstream LLMs exhibited moderate DC, with O1-mini leading at an average DCS of 69.8%, followed by Claude 3.5 at 59.6% and GPT-4o at 52.5%, while GPT-3.5 lagged at 28.9%. Performance varied significantly by DR type: identity transformations (where outputs remain unchanged) saw higher scores, but general and task-specific rules—such as symmetry in graphs or logical equivalences—dropped to averages of 49.1% and 34.3%, respectively. Error analysis categorized failures into three types: DR-Unaware (LLMs ignore input changes), DR-Mislocalized (they detect changes but misjudge output adjustments), and DR-Misapplied (they correctly identify the rule but err in execution), with DR-Misapplied accounting for over 54% of errors. For example, in math tasks, models often integrated functions correctly initially but failed to adjust outputs when inputs were modified, highlighting a reliance on surface-level patterns rather than deep, abstract reasoning.
Of these extend far beyond academic curiosity, touching on real-world applications in fields like software development, where code must adapt to refactoring, or healthcare, where diagnostic models need to handle variations in patient data. To address these shortcomings, the researchers proposed Derivation Prompting (DP), a novel prompt engineering technique that explicitly guides LLMs through three steps: explaining input changes, describing the corresponding output transformations, and applying those changes to previous answers. DP achieved an average DCS improvement of 15.2% across all models, outperforming s like Chain-of-Thought and Few-Shot learning, and corrected many DR-Misapplied errors by reinforcing the connection between related problems. This approach not only boosts performance but also suggests pathways for making AI systems more reliable and interpretable in dynamic environments, potentially reducing errors in autonomous systems and educational tools.
Despite these advances, the study acknowledges several limitations. The DEVAL framework's datasets, while covering diverse tasks, do not exhaust all real-world scenarios, and error attribution relies on LLM-generated reasoning chains, which may lack the precision of human expert analysis. Additionally, attempts to automate DR generation with LLMs showed that only 38.6% of rules were formally correct, underscoring the need for human oversight. Comparisons with supervised fine-tuning revealed that while it improved robustness in identity transformations, it fell short in enhancing abstract reasoning compared to DP. Future work aims to expand dataset coverage using AI-generated examples and integrate symbolic s for better interpretability, ultimately striving to bridge the gap between human-like reasoning and machine intelligence in an era where AI's role in critical decision-making continues to grow.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn