Vulnerability Detection (VD) in source code is a critical task for ensuring the security of software systems, particularly in C/C++ languages, which are extensively adopted in safety-critical applications. The recent widespread adoption of Large Language Models (LLMs) for software engineering tasks has led to specialized open-source Code-LLMs, tailored to handle programming languages and code-specific challenges. Although these models have achieved promising results for VD through prompt engineering and fine-tuning strategies, existing studies often evaluate them in unrealistic settings, where test data comes from a similar distribution of training data. In this work, we present a comprehensive evaluation of open-source Code-LLMs for VD in C/C++ code, employing both prompt engineering and fine-tuning approaches. We introduce a novel benchmark dataset composed exclusively of functions extracted from real-world, production-level open-source projects, with the aim to conduct a more realistic analysis. Our results highlight the limitations of current Code-LLMs for VD when evaluated under a realistic setup, emphasizing the need for more robust and generalizable solutions for secure software development.

Evaluating Large Language Models for Vulnerability Detection Under Realistic Conditions

Vincenzo Carletti;Pasquale Foggia;Carlo Mazzocca;Giuseppe Parrella
;
Mario Vento
2025

Abstract

Vulnerability Detection (VD) in source code is a critical task for ensuring the security of software systems, particularly in C/C++ languages, which are extensively adopted in safety-critical applications. The recent widespread adoption of Large Language Models (LLMs) for software engineering tasks has led to specialized open-source Code-LLMs, tailored to handle programming languages and code-specific challenges. Although these models have achieved promising results for VD through prompt engineering and fine-tuning strategies, existing studies often evaluate them in unrealistic settings, where test data comes from a similar distribution of training data. In this work, we present a comprehensive evaluation of open-source Code-LLMs for VD in C/C++ code, employing both prompt engineering and fine-tuning approaches. We introduce a novel benchmark dataset composed exclusively of functions extracted from real-world, production-level open-source projects, with the aim to conduct a more realistic analysis. Our results highlight the limitations of current Code-LLMs for VD when evaluated under a realistic setup, emphasizing the need for more robust and generalizable solutions for secure software development.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4915856
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact