As artificial intelligence tools become ubiquitous in classrooms, a critical question emerges: can these systems truly teach, or are they merely advanced answer machines? A new study evaluates three leading large language models—ChatGPT, Gemini, and DeepSeek—across three established pedagogical strategies: providing examples, using explanations and analogies, and employing the Socratic . The research, conducted by a team from institutions including the Federal Institute of São Paulo and the University of Pennsylvania, involved six human judges interacting with the models over 270 hours, simulating beginner programming students in C language exercises. indicate that while AI can follow instructional prompts, its effectiveness varies significantly across models and strategies, with for how these tools are integrated into educational settings.
The study's key finding is that ChatGPT and Gemini generally outperformed DeepSeek in pedagogical skills, but all models showed notable limitations. For instance, in the Examples approach, where models were prompted to offer code snippets without full solutions, all three received low scores for variety, with means around 3.2 out of 5, indicating a lack of contextual diversity in their examples. ChatGPT scored highest in relevance at 4.6, while Gemini led in abstract-concrete connections at 4.2. However, DeepSeek was more likely to provide immediate solutions, an undesirable behavior, doing so in 70.7% of cases compared to Gemini's 2%. This suggests that some models struggle to resist giving direct answers, which could hinder student learning by reducing opportunities for independent problem-solving.
Ologically, the researchers developed a rigorous evaluation protocol to assess pedagogical skills beyond mere correctness. They used role prompting, where each model was given an initial instruction to act as an encouraging teacher for beginner C programming students, with specific rules against providing immediate solutions. The judges, all with advanced programming knowledge, interacted with each model 25 times per pedagogical approach, following a structured workflow that included clearing model memory and simulating student responses such as wrong answers or requests for clarification. Evaluations were based on criteria like relevance, correctness, and adaptability, scored on a 0-5 scale, with statistical analysis including Kruskal-Wallis and Dunn's tests to compare model performance. This human-in-the-loop approach aimed to mimic real-world educational scenarios, though it relied on expert simulations rather than actual learners.
Analysis, detailed across 1,350 evaluations, reveals nuanced differences. In the Explanations and Analogies approach, ChatGPT achieved a final average score of 4.59, significantly higher than DeepSeek's 4.36, with strengths in clarity and critical parts focus. However, all models scored lowest in connecting to previous knowledge, with means around 3.9, highlighting a gap in personalizing instruction. For the Socratic , ChatGPT excelled with a final average of 4.63, while DeepSeek lagged at 4.05, particularly in providing counterexamples and promoting critical thinking. Statistical tests confirmed these differences, such as in the Socratic 's critical thinking promotion criterion, where ChatGPT scored 4.67 compared to DeepSeek's 3.96. These data points, illustrated in figures like Figure 13, underscore that model performance is highly sensitive to pedagogical strategy, with the Socratic showing the greatest variability.
Of this research are significant for educators and policymakers considering AI integration in schools. The study suggests that AI models like ChatGPT and Gemini can serve as supplementary teaching partners, especially in resource-limited settings where human instructors are scarce, as noted in the paper's discussion of AIED Unplugged. However, their effectiveness depends on careful prompt design and model selection; for example, Gemini's tendency to start with overly basic concepts in the Socratic might frustrate students. The ethical considerations are also paramount, as the paper warns that reliance on proprietary, cloud-based models could exacerbate digital divides. This work moves beyond accuracy benchmarks to evaluate pedagogical utility, offering a framework for future assessments that prioritize educational value over mere correctness.
Limitations of the study, as outlined in the paper, include potential cultural biases, since Western judges evaluated models from both Western and Eastern origins, and the use of expert simulations rather than real students, which may not capture authentic learner interactions. Additionally, model updates during the testing period, such as ChatGPT's transition to version 4.1, required adjustments that could have influenced . The judges were not blinded to model identities, possibly introducing brand bias, and the focus on beginner-level C programming exercises may have restricted the range of pedagogical responses. Despite these constraints, the research provides a foundational protocol for evaluating AI teaching capabilities, with future work planned to include more models, inter-rater reliability analysis, and real classroom trials to assess learning gains.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn