AI Grades Student Code Almost as Well as Human Tutors

TL;DR

A new study finds AI can assess student code using a fix-then-grade method, giving detailed feedback and reducing teacher workload.

Grading programming assignments in large introductory computer science courses is a time-consuming task that often leads to inconsistencies, as human tutors struggle to apply rubrics uniformly across hundreds of submissions. While automated unit tests offer a partial solution, they typically provide only pass/fail outcomes without the nuance needed for partial credit. A new study from the University of British Columbia Okanagan explores how generative AI can step in to provide scalable, objective grading that mirrors human judgment, comparing two distinct approaches to see which better aligns with tutor evaluations.

The researchers found that AI grading, when guided by clear prompts, can produce scores close to those assigned by human teaching assistants. They tested two s: a Direct approach, where the AI applies a rubric directly to student code, and a novel Reverse approach, where the AI first fixes errors in the code and then deduces a grade based on the nature and number of corrections. According to the paper, both s showed promise, with the Reverse technique often providing more fine-grained assessments, especially for logic errors. For example, in a dataset of synthetic code submissions categorized as Poor, Moderate, and Good, the Reverse with a 100-point rubric scaled to 10 points yielded average scores of 5.74 for Poor, 7.37 for Moderate, and 9.05 for Good, closely matching human tutor averages of 2.27, 5.68, and 7.83, respectively.

Ology involved creating synthetic programming questions in Java, designed to simulate typical CS1-level assignments, with student-like solutions generated using Gemini Flash 2.0 to control for quality and error types. The AI grading was performed using GPT-4, selected for its reliability in code-related tasks, with prompts iteratively refined for both Direct and Reverse techniques. The Direct instructed the AI to score each rubric category independently, while the Reverse asked it to debug the code, classify fixes as minor or major, and then assign a score based on correction effort. To assess accuracy, AI-assigned scores were compared against evaluations from experienced human tutors, using both a 10-point scale and a 100-point scale expanded for finer granularity.

From the study, detailed in figures and tables in the paper, indicate that AI grading aligns well with human scores, particularly for higher-quality submissions. Box plots in the paper show that for Good-quality submissions, AI s, especially Reverse, closely matched tutor grading, while for Poor-quality submissions, Reverse scores tended to be higher, suggesting potential under-penalization. The data also revealed that using a 100-point rubric allowed for more granular scoring, helping both AI s better approximate tutor scores, with reverse grading often coming within 0.5 points on average. Syntax errors were consistently detected across approaches, but logic errors proved more challenging, with Reverse grading generally performing better at identifying and correcting these flaws before scoring.

Of this research are significant for educational settings, where scalable grading systems could reduce instructor workloads and improve consistency. The Reverse , in particular, offers constructive feedback by showing students how to fix their code, which may enhance learning outcomes. However, the study notes that practical implementation requires careful prompt engineering and clear rubrics to avoid inconsistencies. A hybrid workflow, where AI handles first-pass grading and humans review uncertain cases, is suggested as a viable approach to maintain fairness and efficiency in large courses.

Despite these promising , the study acknowledges several limitations. The code samples used were short and focused on single-function problems, which may not generalize to more complex programs involving multiple files or object-oriented design. Additionally, the dataset was synthetic, created to simulate student submissions, and thus does not capture the full range of real student mistakes and styles. The AI models were guided by prompts rather than trained on specific rubrics, and their output could vary with different instructions or model versions. Other unaddressed concerns include the cost of running large models, potential response delays, and fairness issues related to bias in AI-generated feedback, indicating that further testing is needed before real-world deployment.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn