Descriptive Assessment of Student Code by LLMs: An Empirical Study

Costagliola, G.; De Rosa, M.; Fuccella, V.; Piscitelli, A.

doi:10.1007/978-3-032-13184-3_20

The improved capabilities of Large Language Models (LLMs) enable their use in various fields, including education. Teachers and students already use LLMs to support teaching and learning. In this paper, we analyse 300 comments generated by an LLM-based assessment system, based on three LLMs like GPT-4o, GPT-3.5, and claude-sonnet-20241022, and measure the agreement level between the LLM insight and the teacher’s textual evaluation. The results showed an average agreement level of 2.5 out of 3, with Claude Sonnet achieving the highest agreement (2.7, SD=0.5) using the zero-shot prompting strategy, followed by GPT-4o and GPT-3.5. Zero-shot prompting also resulted in the highest rate of full agreement (level 3) with the teacher, peaking at 69% for Claude Sonnet. A qualitative analysis of the remaining disagreements revealed that most inconsistencies were due to the LLMs overlooking critical logical errors or focusing on stylistic aspects instead of functionality. These findings highlight the potential and current limits of LLMs in providing pedagogically aligned feedback.

Descriptive Assessment of Student Code by LLMs: An Empirical Study

Costagliola G.;De Rosa M.;Fuccella V.;Piscitelli A.

2026

Abstract

The improved capabilities of Large Language Models (LLMs) enable their use in various fields, including education. Teachers and students already use LLMs to support teaching and learning. In this paper, we analyse 300 comments generated by an LLM-based assessment system, based on three LLMs like GPT-4o, GPT-3.5, and claude-sonnet-20241022, and measure the agreement level between the LLM insight and the teacher’s textual evaluation. The results showed an average agreement level of 2.5 out of 3, with Claude Sonnet achieving the highest agreement (2.7, SD=0.5) using the zero-shot prompting strategy, followed by GPT-4o and GPT-3.5. Zero-shot prompting also resulted in the highest rate of full agreement (level 3) with the teacher, peaking at 69% for Claude Sonnet. A qualitative analysis of the remaining disagreements revealed that most inconsistencies were due to the LLMs overlooking critical logical errors or focusing on stylistic aspects instead of functionality. These findings highlight the potential and current limits of LLMs in providing pedagogically aligned feedback.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2026

Appare nelle tipologie:

2.1.1 Articolo su libro con DOI

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4952095

Citazioni

ND

0

ND

UniSa - IRIS Institutional Research Information System

Descriptive Assessment of Student Code by LLMs: An Empirical Study

Costagliola G.;De Rosa M.;Fuccella V.;Piscitelli A.

2026

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Citazioni

social impact

UniSa - IRIS Institutional Research Information System

Descriptive Assessment of Student Code by LLMs: An Empirical Study

Costagliola G.;De Rosa M.;Fuccella V.;Piscitelli A.

2026

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)