The improved capabilities of Large Language Models (LLMs) enable their use in various fields, including education. Teachers and students already use LLMs to support teaching and learning. In this paper, we analyse 300 comments generated by an LLM-based assessment system, based on three LLMs like GPT-4o, GPT-3.5, and claude-sonnet-20241022, and measure the agreement level between the LLM insight and the teacher’s textual evaluation. The results showed an average agreement level of 2.5 out of 3, with Claude Sonnet achieving the highest agreement (2.7, SD=0.5) using the zero-shot prompting strategy, followed by GPT-4o and GPT-3.5. Zero-shot prompting also resulted in the highest rate of full agreement (level 3) with the teacher, peaking at 69% for Claude Sonnet. A qualitative analysis of the remaining disagreements revealed that most inconsistencies were due to the LLMs overlooking critical logical errors or focusing on stylistic aspects instead of functionality. These findings highlight the potential and current limits of LLMs in providing pedagogically aligned feedback.
Descriptive Assessment of Student Code by LLMs: An Empirical Study
Costagliola G.;De Rosa M.;Fuccella V.;Piscitelli A.
2026
Abstract
The improved capabilities of Large Language Models (LLMs) enable their use in various fields, including education. Teachers and students already use LLMs to support teaching and learning. In this paper, we analyse 300 comments generated by an LLM-based assessment system, based on three LLMs like GPT-4o, GPT-3.5, and claude-sonnet-20241022, and measure the agreement level between the LLM insight and the teacher’s textual evaluation. The results showed an average agreement level of 2.5 out of 3, with Claude Sonnet achieving the highest agreement (2.7, SD=0.5) using the zero-shot prompting strategy, followed by GPT-4o and GPT-3.5. Zero-shot prompting also resulted in the highest rate of full agreement (level 3) with the teacher, peaking at 69% for Claude Sonnet. A qualitative analysis of the remaining disagreements revealed that most inconsistencies were due to the LLMs overlooking critical logical errors or focusing on stylistic aspects instead of functionality. These findings highlight the potential and current limits of LLMs in providing pedagogically aligned feedback.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


