AI Essay Graders Fail Basic Comprehension Tests

TL;DR

Automated scoring systems give high marks to nonsensical text while missing factual errors, raising concerns about their use in high-stakes education.

Automated essay scoring systems are increasingly used to evaluate student writing in high-stakes assessments, from college admissions to visa applications. These AI tools promise to reduce teacher workload, but new research reveals they often grade based on superficial word patterns rather than meaningful content comprehension, potentially affecting millions of students' life-changing decisions.

The key finding demonstrates that state-of-the-art automated essay scoring (AES) systems treat essays as 'word soups'—collections of important words whose order and context don't significantly impact the final score. Researchers discovered that removing up to 51% of an essay's least important words barely changes the score, while essays consisting only of the most important words (creating grammatically incoherent text) still receive 85% of their original scores.

Using integrated gradients, a technique that identifies which words contribute most to an essay's score, researchers analyzed two leading AES models: SkipFlow and Memory Augmented Neural Network (MANN). They tested these systems on the widely-used ASAP-AES dataset, which contains over 12,000 student essays across eight different prompts. The methodology involved systematically modifying essays to test how scoring responds to changes in coherence, factuality, and content relevance.

The results show concerning patterns. When researchers shuffled sentences randomly—completely destroying essay coherence—scores changed by less than 7% for SkipFlow and 1% for MANN. Even more troubling, when researchers introduced factual lies like 'the world is flat' into essays, scores actually increased in 70% of cases when the lie appeared at the beginning. The systems showed little sensitivity to vocabulary quality, with uncommon words like 'legerdemain' and 'propinquity' receiving negative attribution while common words scored positively.

These findings matter because automated scoring systems are increasingly deployed for high-stakes decisions. The research reveals that current systems focus disproportionately on specific keywords while ignoring context, coherence, and factual accuracy. This means students could potentially game the system by including certain words regardless of their relevance or by repeating content, while well-reasoned essays might receive lower scores if they lack the system's preferred vocabulary.

The study acknowledges limitations in only examining word-level attribution rather than phrase or paragraph-level analysis. The researchers also note that their testing focused on two specific models, though these represent current state-of-the-art approaches. Future work should explore whether these issues persist across different AES architectures and develop more robust validation methods for automated scoring systems used in educational assessment.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn