UniSa - IRIS Institutional Research Information System

Large Language Models (LLM) empower many modern software systems, and are required to be highly accurate and reliable. Evaluating LLM poses challenges due to the high costs of manual labeling and of validation of labeled data. This study investigates the suitability of probabilistic operational testing for effective and efficient evaluation of LLM, focusing on a case study with DistilBERT. To this aim, we adopt an existing framework (DeepSample) for Deep Neural Network (DNN) testing and adapt it to the LLM domain by introducing auxiliary variables tailored to LLM and classification tasks. Through a comprehensive evaluation, we demonstrate how sampling-based operational testing can yield reliable LLM accuracy estimates and effectively expose failures, or, under testing budget constraints, it can find a trade off between accuracy estimation and failure exposure. The experimental results, using DistilBERT on three sentiment analysis datasets, show that sampling-based methods can provide cost effective and reliable operational accuracy assessment for LLM. These findings offer practical insights for testers and help address critical gaps in current LLM evaluation practices.

Adaptive Probabilistic Operational Testing for Large Language Models Evaluation

Asgari A.;Guerriero A.;Pietrantuono R.;Russo S.

2025

Abstract

Large Language Models (LLM) empower many modern software systems, and are required to be highly accurate and reliable. Evaluating LLM poses challenges due to the high costs of manual labeling and of validation of labeled data. This study investigates the suitability of probabilistic operational testing for effective and efficient evaluation of LLM, focusing on a case study with DistilBERT. To this aim, we adopt an existing framework (DeepSample) for Deep Neural Network (DNN) testing and adapt it to the LLM domain by introducing auxiliary variables tailored to LLM and classification tasks. Through a comprehensive evaluation, we demonstrate how sampling-based operational testing can yield reliable LLM accuracy estimates and effectively expose failures, or, under testing budget constraints, it can find a trade off between accuracy estimation and failure exposure. The experimental results, using DistilBERT on three sentiment analysis datasets, show that sampling-based methods can provide cost effective and reliable operational accuracy assessment for LLM. These findings offer practical insights for testers and help address critical gaps in current LLM evaluation practices.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2025

Appare nelle tipologie:

4.1 Contributi in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4918558

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

0

0

social impact