Large language models (LLMs) like GPT-4o and Llama-3.3 struggle with fundamental logical reasoning about obligations and permissions, displaying inconsistencies and human-like biases that challenge their reliability in real-world applications. A new study from Keio University and the University of Tokyo systematically evaluated these AI systems on normative reasoning—the ability to handle concepts like 'must,' 'may,' and 'not allowed'—revealing that even top-performing models frequently misinterpret basic logical patterns.
The researchers found that LLMs often fail to correctly infer simple relationships between obligations and permissions. For example, when given the premise 'You must take care of your health,' models like GPT-4o incorrectly concluded that 'You can choose to take care of your health' does not follow, misinterpreting 'can' as indicating optionality rather than permission. This inconsistency was particularly evident in the Mu-Mi pattern (obligation implies permission), where most models performed poorly, with only Llama-3.3-70B-Instruct achieving high accuracy. The study also highlighted issues with controversial patterns like Ross's paradox, where models tended to accept invalid inferences, such as deriving 'You must post the letter or burn it' from 'You must post the letter.'
To assess these capabilities, the team created a new dataset covering 11 basic logic patterns and 8 syllogistic inferences in both normative (deontic) and epistemic (knowledge-based) domains. They tested five models—GPT-4o, GPT-4o-mini, Llama-3.1-8B-Instruct, Llama-3.3-70B-Instruct, and Phi-4—using zero-shot, few-shot, and chain-of-thought prompting. The experiments involved 640 logic problems and 480 syllogistic problems, with models evaluated on accuracy against expected answers. Results showed that performance varied significantly by pattern, with negation-heavy inferences like Modus Tollens and Denying Antecedent proving especially challenging.
Analysis revealed that LLMs exhibit content effects, performing better on problems where conclusions align with common sense. For instance, models achieved higher accuracy on congruent content (e.g., 'It is not permissible to not eat breakfast') compared to incongruent content (e.g., 'It is not permissible to not care for children') or nonsense sentences. This mirrors human reasoning biases, where believable conclusions are accepted more readily, regardless of logical validity. Domain specificity also played a role: in syllogistic tasks, models generally performed better on normative reasoning than epistemic under zero-shot conditions, but this advantage reversed in logic tasks, indicating that relative difficulty depends on the specific inference type.
These findings matter because normative reasoning is crucial for AI systems deployed in social, ethical, or legal contexts, where understanding obligations and permissions is essential. Inconsistencies could lead to unreliable decisions in applications like automated compliance checks or ethical AI assistants. The study's limitations include its focus on controlled tasks rather than real-world deliberation and the rapid evolution of LLMs, which may alter these results over time. However, the work underscores the need for improved logical consistency in AI, highlighting that current models, despite their advances, still struggle with reasoning patterns that humans find intuitive.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn